Friendship
I've arrived in Maine for a long weekend, visiting my mother in the aptly-named town of Friendship. The peepers are croaking up a storm outside, but if you listen carefully, you can hear raindrops.
An indignant letter from the insurance company is on the kitchen table. Apparently, this house has something called 'knob and tube wiring'. In the eyes of the insurance company, this is on…
Microsoft Headhunting
A curious surprise - it seems Microsoft is recruiting bloggers!
I just got a headhunting letter from one Kat Morrell, inviting me to apply for a job with the MSN Search people. From the letter, it sounds like they're preparing the Anti-Google - indexing the entire Internet to create "a search engine that will leapfrog over current technologies".
To which I say, leapfrog over my …
Search Patents
Software patents are the kiss of death for innovation, because under current practice you can get the most obvious 'method' for doing just about anything patented. Have you ever swung sideways on a swing? Patent in…
Thai Bloggers
The Thais are blogging! What a beautiful script they have.
Can anyone point me to other sources of weblogs in Southeast Asia? How about Africa?…
Language Barriers in Blogging
For a while now, I've been interested in how language barriers affect our ability to communicate online. With some real blog census data now coming in (and with the better half gone to her sister's graduation, and so unable to keep me from wasting a perfectly good Saturday) I spent today trying to measure how high those barriers are.
As I write t…
Bear!
I saw a bear!
I was driving home along Route 100, which for many miles is in a narrow V-shaped valley with forest on both sides of the narrow road. And there it was, a black bear, who must have just crossed the road and was lumbering up the slope, westbound, to attend to his bear business up in the Green Mountains.
What always amazes me about bears is how sinuous their movements ar…
Crawl Update
After a dispiriting two days of zapping duplicate URLs, and watching the crawl count drop by tens of thousands of weblogs, it looks like we hit a rich pocket of ore in the site queue. Minutes ago, the census crawl nosed past Technorati, with about a third of a million weblog sites indexed. I believe this is now the largest general weblog list on the …
Markup Proposal
Part of the problem in indexing weblogs is finding them in the first place. Weblogs.com and sites like it are a start, but there are plenty of weblogs that don't announce their updates anywhere. The only way to find them is by crawling.
Once you've found a weblog, you still have a problem. It's not easy to find dates, link lists, or boundaries between weblog posts. There are a zilli…
Fairvue and Technorati
Something funny happening on Technorati. A lot of people on Blogspot are using this as their default template. But all the sample links in the blogroll on that template point to a site called Fairvue. If you look at the Technorati Top 100 links, Fairv…
Poland: Global Power
Today's Wall Street Journal editorial page announces that Poland is now in the big leagues, and Instapundit broadcasts the happy news to the Internet:
Hard to believe, but Poland is now arguably a more consequential global power than either France or Germany. And the angry…
Waypath
Today I owe a shoutout to the Waypath Project. Steve Nieker was kind enough to share his list of about a hundred thousand websites, and suddenly my crawler went from adding 200 blogs per hour to adding 11,000. At this rate, we might hit 200,000 weblogs indexed later in the night.
The sites being added now have the proportions of a martini, in w…
Blogging in Japan
A recent article in the Japan Media Review explains that most bloggers in Japan are actually lonely expats. It seems most Japanese are still reluctant to keep a weblog. And after one visit to the Japan Blogging Association site, I understand why.…
Drexler Keynote
There was a wonderful keynote presentation at the O'Reilly Emerging Tech conference, given by K. Eric Drexler, that has stuck in my mind and shows no signs of letting go. Since it hasn't received the same level of saturation coverage as some other talks, I figured it would be a perfect night to slug down a whisky sour and write about nan…
XML-RPC Interface
Inspired by the web services madness at BlogShares and Technorati, I've whipped up a quick XML-RPC interface to our own NITLE crawl database. You can get the language, authoring tool, and number of incoming and outgoing blog links for any blog URL we have listed (110K blogs and daily growin'). The micro-documentation is available right h…
Blog Language Rankings
A shocker in the language rankings, as Spanish moves in to bump Icelandic out of the top five!
That's thanks to Fernando Tricas over at Blogalia, who sent in a list of about 2,000 known Spanish blogs. Tricas and his group have been trying to measure the Spanish blogosphere…
More Blog Crawl
I owe a serious thank you to Seyed Razavi at BlogShares and Brigitte at Eatonweb, both of whom contributed an enormous list of blog URLs for the benefit of the blog crawler. Thanks to them, it looks like we'll be breaking through the 100,000 blog barrier sometime this afternoon, with hundreds of thousands of poten…
Distinguishing Farsi and Arabic
A handy Idle Words tip, from me to you:
You can distinguish Arabic writing from Farsi by looking to see if any of the words has a little triangle of dots underneath. Those subscript dots are used to represent the sounds "v" and "p", neither of which exist in Arabic.
As Johnnie Cochran says, "If there's a triplet down below, then Arabic must go".…
dd
Via Heiko Hebig, a pointer to Der Weblogcheckup, the German weblogs.com. I don't speak a word of German, so I'm transfixed by page names like Tools und Gimmicks. But my all-time favorite has to be Der Pingmacher…
Blog::Identify
I just uploaded WWW::Blog::Identify to the CPAN. It may be of use to you if you are indexing a large number of weblogs and want to figure out what flavor they are. Of course, in the best of all worlds, blog tool authors would make sure to include a line like this in their default template: <meta name="generator" con…
Lessig and Media Concentration
Lawrence Lessig posts a brilliant letter from an Aussie on the risks of media concentration. Australia is down to two media conglomerates now, which own pretty much everything, and aren't afraid to use their power.
Lessig has been campaigning against a 'deregulation' initiative that will lead to even …
Default English Blogging
The language detector is busy chugging its way through this morning's crop of 25,000 weblogs. It just told me it thinks the Drudge Report is in "Middle Frisian". That would explain everything!
As I peek at the foreign blogs accumulating in the crawl index, it occurs to me that life can be full of annoyances if you write a weblog in a languag…
Spring
Spring in Vermont - such a carnival! It feels like every plant, animal, and insect just got let out of prison, and is madly making up for lost time. It's also the time of year when the days get palpably longer, and already dusk stretches out towards nine o'clock. Spring is Vermont's underhanded way of trying to make me forget about the winter, but I won't be suckered. I still have those…
Blog Crawl Update
The results on the crawl statistics page are now separated (as befits a red-blooded American like me) into Us and The Rest of the World. That is, English language blogs, and all others. There's also a handy bar chart of language distribution so you can tell just which wily foreigners are blogging the most.
As time goes on, I…
auto lang
I've been playing with automatic language identification on the big weblog list, using a great little Perl script by Gertjan van Noord. The program reads the text and guesses the language based on letter clusters, and I've been running it on the blogs in my database all afternoon. A few minutes ago, I took a peek at the results, and foun…
assign
An exciting assignment at work* - the boss says to me "go forth unto the Internet, and find me every weblog you can get your hands on!". It seems we need a large, live collection to prove our search algorithms on. Not exactly a Mt. Everest of data, but something more than the little molehills of documents we've conquered so far. So I have dutifully started crawling the Web, as well as asking…
Walrus Graphs
A beautiful gallery of graph visualizations from a program called Walrus. Most of the graphs are plain old workhorse data sets (large Web collections, directory trees, code files in a CVS repository), but the resulting images are breathtaking.
Walrus …
Drexler Keynote
One of the best talks at the Emerging Technology conference was a keynote presentation given by K. Eric Drexler, who coined the term "nanotechnology" and has been thinking about its implications for about twenty years.
Nanotechnology has been suffering from a mild buzzword hangover recently, c…
Search CG
If you're the kind of person reading my blog late on a Friday night, you just might be interested in Search::ContextGraph. Pour yourself a cold one, hit the link, and let the party begin!…
brevity is for the weak
Greatest Hits
The Alameda-Weehawken Burrito TunnelThe story of America's most awesome infrastructure project.
Argentina on Two Steaks A Day
Eating the happiest cows in the world
Scott and Scurvy
Why did 19th century explorers forget the simple cure for scurvy?
No Evidence of Disease
A cancer story with an unfortunate complication.
Controlled Tango Into Terrain
Trying to learn how to dance in Argentina
Dabblers and Blowhards
Calling out Paul Graham for a silly essay about painting
Attacked By Thugs
Warsaw police hijinks
Dating Without Kundera
Practical alternatives to the Slavic Dave Matthews
A Rocket To Nowhere
A Space Shuttle rant
Best Practices For Time Travelers
The story of John Titor, visitor from the future
100 Years Of Turbulence
The Wright Brothers and the harmful effects of patent law
Every Damn Thing
Every Damn Thing
Your Host
Maciej Cegłowski
maciej @ ceglowski.com
Threat
Please ask permission before reprinting full-text posts or I will crush you.