Notice: Only variables should be passed by reference in /Users/maciej/Code/iw/site/month.php on line 8

Notice: Only variables should be passed by reference in /Users/maciej/Code/iw/site/month.php on line 8

« April 2003  2003 June 2003 »


I've arrived in Maine for a long weekend, visiting my mother in the aptly-named town of Friendship. The peepers are croaking up a storm outside, but if you listen carefully, you can hear raindrops.

An indignant letter from the insurance company is on the kitchen table. Apparently, this house has something called 'knob and tube wiring'. In the eyes of the insurance company, this is on…


Microsoft Headhunting

A curious surprise - it seems Microsoft is recruiting bloggers!

I just got a headhunting letter from one Kat Morrell, inviting me to apply for a job with the MSN Search people. From the letter, it sounds like they're preparing the Anti-Google - indexing the entire Internet to create "a search engine that will leapfrog over current technologies".

To which I say, leapfrog over my …


Search Patents

Software patents are the kiss of death for innovation, because under current practice you can get the most obvious 'method' for doing just about anything patented. Have you ever swung sideways on a swing? Patent in…


Thai Bloggers

The Thais are blogging! What a beautiful script they have.

Can anyone point me to other sources of weblogs in Southeast Asia? How about Africa?…


Language Barriers in Blogging

For a while now, I've been interested in how language barriers affect our ability to communicate online. With some real blog census data now coming in (and with the better half gone to her sister's graduation, and so unable to keep me from wasting a perfectly good Saturday) I spent today trying to measure how high those barriers are.

As I write t…



I saw a bear!

I was driving home along Route 100, which for many miles is in a narrow V-shaped valley with forest on both sides of the narrow road. And there it was, a black bear, who must have just crossed the road and was lumbering up the slope, westbound, to attend to his bear business up in the Green Mountains.

What always amazes me about bears is how sinuous their movements ar…


Crawl Update

After a dispiriting two days of zapping duplicate URLs, and watching the crawl count drop by tens of thousands of weblogs, it looks like we hit a rich pocket of ore in the site queue. Minutes ago, the census crawl nosed past Technorati, with about a third of a million weblog sites indexed. I believe this is now the largest general weblog list on the …


Markup Proposal

Part of the problem in indexing weblogs is finding them in the first place. and sites like it are a start, but there are plenty of weblogs that don't announce their updates anywhere. The only way to find them is by crawling.

Once you've found a weblog, you still have a problem. It's not easy to find dates, link lists, or boundaries between weblog posts. There are a zilli…


Fairvue and Technorati

Something funny happening on Technorati. A lot of people on Blogspot are using this as their default template. But all the sample links in the blogroll on that template point to a site called Fairvue. If you look at the Technorati Top 100 links, Fairv…


Poland: Global Power

Today's Wall Street Journal editorial page announces that Poland is now in the big leagues, and Instapundit broadcasts the happy news to the Internet:

Hard to believe, but Poland is now arguably a more consequential global power than either France or Germany. And the angry…



Today I owe a shoutout to the Waypath Project. Steve Nieker was kind enough to share his list of about a hundred thousand websites, and suddenly my crawler went from adding 200 blogs per hour to adding 11,000. At this rate, we might hit 200,000 weblogs indexed later in the night.

The sites being added now have the proportions of a martini, in w…


Blogging in Japan

A recent article in the Japan Media Review explains that most bloggers in Japan are actually lonely expats. It seems most Japanese are still reluctant to keep a weblog. And after one visit to the Japan Blogging Association site, I understand why.…


Drexler Keynote

There was a wonderful keynote presentation at the O'Reilly Emerging Tech conference, given by K. Eric Drexler, that has stuck in my mind and shows no signs of letting go. Since it hasn't received the same level of saturation coverage as some other talks, I figured it would be a perfect night to slug down a whisky sour and write about nan…


XML-RPC Interface

Inspired by the web services madness at BlogShares and Technorati, I've whipped up a quick XML-RPC interface to our own NITLE crawl database. You can get the language, authoring tool, and number of incoming and outgoing blog links for any blog URL we have listed (110K blogs and daily growin'). The micro-documentation is available right h…


Blog Language Rankings

A shocker in the language rankings, as Spanish moves in to bump Icelandic out of the top five!

That's thanks to Fernando Tricas over at Blogalia, who sent in a list of about 2,000 known Spanish blogs. Tricas and his group have been trying to measure the Spanish blogosphere…


More Blog Crawl

I owe a serious thank you to Seyed Razavi at BlogShares and Brigitte at Eatonweb, both of whom contributed an enormous list of blog URLs for the benefit of the blog crawler. Thanks to them, it looks like we'll be breaking through the 100,000 blog barrier sometime this afternoon, with hundreds of thousands of poten…


Distinguishing Farsi and Arabic

A handy Idle Words tip, from me to you:

You can distinguish Arabic writing from Farsi by looking to see if any of the words has a little triangle of dots underneath. Those subscript dots are used to represent the sounds "v" and "p", neither of which exist in Arabic.

As Johnnie Cochran says, "If there's a triplet down below, then Arabic must go".…



Via Heiko Hebig, a pointer to Der Weblogcheckup, the German I don't speak a word of German, so I'm transfixed by page names like Tools und Gimmicks. But my all-time favorite has to be Der Pingmacher…



I just uploaded WWW::Blog::Identify to the CPAN. It may be of use to you if you are indexing a large number of weblogs and want to figure out what flavor they are. Of course, in the best of all worlds, blog tool authors would make sure to include a line like this in their default template: <meta name="generator" con…


Lessig and Media Concentration

Lawrence Lessig posts a brilliant letter from an Aussie on the risks of media concentration. Australia is down to two media conglomerates now, which own pretty much everything, and aren't afraid to use their power.

Lessig has been campaigning against a 'deregulation' initiative that will lead to even …


Default English Blogging

The language detector is busy chugging its way through this morning's crop of 25,000 weblogs. It just told me it thinks the Drudge Report is in "Middle Frisian". That would explain everything!

As I peek at the foreign blogs accumulating in the crawl index, it occurs to me that life can be full of annoyances if you write a weblog in a languag…



Spring in Vermont - such a carnival! It feels like every plant, animal, and insect just got let out of prison, and is madly making up for lost time. It's also the time of year when the days get palpably longer, and already dusk stretches out towards nine o'clock. Spring is Vermont's underhanded way of trying to make me forget about the winter, but I won't be suckered. I still have those…


Blog Crawl Update

The results on the crawl statistics page are now separated (as befits a red-blooded American like me) into Us and The Rest of the World. That is, English language blogs, and all others. There's also a handy bar chart of language distribution so you can tell just which wily foreigners are blogging the most.

As time goes on, I…


auto lang

I've been playing with automatic language identification on the big weblog list, using a great little Perl script by Gertjan van Noord. The program reads the text and guesses the language based on letter clusters, and I've been running it on the blogs in my database all afternoon. A few minutes ago, I took a peek at the results, and foun…



An exciting assignment at work* - the boss says to me "go forth unto the Internet, and find me every weblog you can get your hands on!". It seems we need a large, live collection to prove our search algorithms on. Not exactly a Mt. Everest of data, but something more than the little molehills of documents we've conquered so far. So I have dutifully started crawling the Web, as well as asking…


Walrus Graphs

A beautiful gallery of graph visualizations from a program called Walrus. Most of the graphs are plain old workhorse data sets (large Web collections, directory trees, code files in a CVS repository), but the resulting images are breathtaking.

Walrus …


Drexler Keynote

One of the best talks at the Emerging Technology conference was a keynote presentation given by K. Eric Drexler, who coined the term "nanotechnology" and has been thinking about its implications for about twenty years.

Nanotechnology has been suffering from a mild buzzword hangover recently, c…


Let jobs

Who let the jobs out? Bush! Bush! Bush!…


Search CG

If you're the kind of person reading my blog late on a Friday night, you just might be interested in Search::ContextGraph. Pour yourself a cold one, hit the link, and let the party begin!…

Greatest Hits

The Alameda-Weehawken Burrito Tunnel
The story of America's most awesome infrastructure project.

Argentina on Two Steaks A Day
Eating the happiest cows in the world

Scott and Scurvy
Why did 19th century explorers forget the simple cure for scurvy?

No Evidence of Disease
A cancer story with an unfortunate complication.

Controlled Tango Into Terrain
Trying to learn how to dance in Argentina

Dabblers and Blowhards
Calling out Paul Graham for a silly essay about painting

Attacked By Thugs
Warsaw police hijinks

Dating Without Kundera
Practical alternatives to the Slavic Dave Matthews

A Rocket To Nowhere
A Space Shuttle rant

Best Practices For Time Travelers
The story of John Titor, visitor from the future

100 Years Of Turbulence
The Wright Brothers and the harmful effects of patent law

Every Damn Thing

Every Damn Thing

2020 Mar Apr Jun Aug Sep Oct
2019 May Jun Jul Aug Dec
2018 Oct Nov Dec
2017 Feb Sep
2016 May Oct
2015 May Jul Nov
2014 Jul Aug
2013 Feb Dec
2012 Feb Sep Nov Dec
2011 Aug
2010 Mar May Jun Jul
2009 Jan Feb Mar Apr May Jun Jul Aug Sep
2008 Jan Apr May Aug Nov
2007 Jan Mar Apr May Jul Dec
2006 Feb Mar Apr May Jun Jul Aug Sep Oct Nov
2005 Jan Feb Mar Apr Jul Aug Sep Oct Nov Dec
2004 Jan Feb Mar Apr May Jun Jul Aug Oct Nov Dec
2003 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2002 May Jun Jul Aug Sep Oct Nov Dec

Your Host

Maciej Cegłowski


Please ask permission before reprinting full-text posts or I will crush you.