* The Internet’s Major Search Engines’ And Directories’ Dirty Little Secret

Their databases and directories are out-of-date and incomplete.

By Erik J. Heels

First published 4/2/1997; Law Practice Management magazine, “nothing.but.net” column; American Bar Association

The itsy bitsty spider (or robot or crawler) went up the water spout (or to the to-be-indexed Web site). Down came the rain (and the flood of flat-rate AOL users) and washed the spider out (the indexing task incomplete). Out came the Sun (and Microsoft, HotBot, UUNet, and others) and dried up all the rain (promising better and faster indexing in exchange for advertising revenue). Then the itsy bitsty spider (et al.) went up the spout again (but, tut-tut, it looks like rain).

The Internet’s major search engines and directories are sharing a dirty little secret. Their databases and directories are out-of-date and incomplete. Here’s a test you can run yourself. Edit your bookmarks file to include an entry for Yahoo’s pick-a-random-URL link (http://random.yahoo.com/bin/ryl/). Or better yet, make it your browser’s default starting page. (If you don’t understand why you need to edit the bookmarks file, try bookmarking the Yahoo random URL first. I made it my default starting page, reluctantly at first, because I preferred my old setup where my bookmarks file was my default starting page. If you have a relatively short bookmarks file, it can be quite convenient to view it as an HTML file. To maintain this functionality, and to overcome my initial reluctance, I simply added my bookmarks file to my bookmarks file. I love Netscape!)

Select that link 100 times and count how many errors you get, including bad server names, bad file names, and “we’ve moved” links (where the served page tells you the address of the new location of what you’re looking for). Here’s the result of my test. Of 100 randomly selected URLs from Yahoo’s directory, 25 were bad. Of those, 20 were either bad server or bad file names, five were of the we’ve-moved-here’s-our-new-URL variety. Assuming that Yahoo’s random feature actually selects a random URL from the Yahoo directory (a nontrivial assumption), my test has a ten percent margin of error, which means that anywhere from 15 to 35 percent of Yahoo’s directory is out of date. Yahoo? Yikes!

The major search engines (Infoseek, Alta Vista, Lycos, HotBot, WebCrawler, and Excite) are also out of date. Each search engine claims that newly submitted URLs are added anywhere from immediately to two weeks. But test this for yourself sometime. My experience has been that it takes at least six weeks for a site’s home page to get added. Even longer for all of the pages from that site to be indexed. If you believe (as I do) that Net years are the equivalent of dog years (roughly seven and a half weeks), then the search engines are approximately one full Net year out of date!

When I contacted the search engine Webmasters about this problem, their replies were eye-opening. “We have encountered unexpected delays,” said one. The estimate on our Web page “was a little too optimistic,” said another. A third replied, “It’s a big Net.” And (my personal favorite), “All you can do is submit your URL and pray.” My e-mail to webmaster@altavista.digital.com bounced, but that’s fodder for another column.

So browser beware. If you are using the search engines and directories while working the Web, you may be dealing with information that is out of date. I predict that various specialty search engines – indexing fewer sites better and faster – will spring up to fill the need for searchable, current, reliable information. Imagine a search engine that just searched Web-based magazines! We could call it … Lexis! Already, there are directory services (such a FindLaw) that cover the legal/Internet market better and faster than general purpose directories (such as Yahoo). Subject-oriented search engines are a logical extension of subject-oriented directories.

File Not Found

The problem of Web pages being 404 (i.e. not found; 404 is the error message number returned by most servers for bogus filenames) is not limited to huge directories like Yahoo. The problem is everywhere. Files that are bookmarked or linked to suddenly disappear. And the 404 error message is about as useful as the DOS abort?-retry?-fail? error message. An old house can be charming. But an old Web site can be downright annoying.

The solution to the 404 error message problem is for Web designers to make Web sites – old and new – more user friendly. When you enter an incorrect key sequence into a software program, it should react predictably and friendly. The same is true for Web sites. If I enter, http://www.yahoo.com/bogus-filename.html, I expect to receive a helpful error message.

Yahoo may not have conquered the problem of keeping its links current, but it has figured out that Web sites can be user friendly. By programming their Web server software to return something intelligent (rather than simply “404”), Yahoo has incorporated software user interface design techniques into the design of its Web site. Yes, Virginia, designing good Web sites does take more than a passing knowledge of HTML.

In fact, whenever I hear of a Web site that professes to be the final word on Web site design, I append “bogus-filename.html” to the end of its URL. Sites that have failed the test include Microsoft (http://www.microsoft.com/bogus-filename.html), Netscape, (http://www.netscape.com/bogus-filename.html), Killer Web Sites (http://www.killersites.com/bogus-filename.html), and Web Pages That Suck (http://www.webpagesthatsuck.com/bogus-filename.html). Sites that passed the test included Yahoo and Apple (http://www.apple.com/bogus-filename.html). Is anybody surprised that Apple has created a user-friendly Web site? That Microsoft hasn’t?

While searching for documentation about the “404” problem, I found a few more interesting examples. The online magazine “Error 404” has it almost figured out (http://cban.worldgate.edmonton.ab.ca/error404/bogus-filename.html) . The Grand Rapids Toastmasters Club Number 404 (http://www.iserv.net/~grtoast/bogus-filename.html) doesn’t. And I especially liked the 404 page from a Web site in Norway (http://www.mimer.no/error/404.html) because the shades-wearing thug-like cartoon dudes appear to be suggesting that the problem is your fault. (I don’t understand Norwegian; I could be wrong.)

The Gerber Effect

The problem of running the bogus-filename test on your site and then comparing your site to your competition’s is that sooner or later you competition will figure it out. I call this the Gerber effect. Gerber, long the leader in baby food, found itself being attacked and challenged by upstart Beech-Nut. On its packaging, Beech-Nut claimed that its fruit baby food was 100% fruit, while Gerber’s was 40-something percent. I don’t know the exact number, because the “40-something” on my box had been carefully covered up with a “54%” sticker. Most (if not all) of Gerber’s fruit baby foods are also now 100% fruit, and Beech-Nut is spending its spare time (no doubt at the urging of some judge somewhere) adding stickers to its packaging. (Can you tell that I have young children?)

Tools You Can Use

So what’s a writer to do? If you write an article for a magazine and it appears on the Web (like this), should you edit your article – after the fact – to update your URLs because somebody else moved or deleted files on their servers? What would George Orwell do? A partial solution is avoidance and denial: only include top-level links to Web sites (i.e. those with top level URLs such as http://www.abanet.org/) and not Web pages at all. Another solution (and one that I follow for nothing.but.net.) is to leave the browsed page exactly as the printed page (for historical accuracy) and to edit the underlying HTML code to point to the new location. If you find a bad URL in this column, please let me know!

There are also tools that can help you diagnose and fix HTML and other problems with your Web site. One of my favorites is Doctor HTML (http://www2.imagiware.com/RxHTML/), which lets you analyze your (or somebody else’s) Web site for HTML problems. WebLint (http://www.khoros.unm.edu/staff/neilb/weblint/gateways.html) is a noncommercial version of the same. Another useful tool is URL-minder (http://www.netmind.com/URL-minder/), a service that sends you e-mail whenever a particular URL that you’ve entered into their database has been updated. Come to think of it, I should use that service to check this column!

As the Web becomes more interactive, new tools will be needed to test and verify the functionality of Web pages. And new special-purpose search engines will be developed as an alternative to the overloaded general-purpose search engines. Java applets and Shockwave plug-ins are increasingly being used to add functionality and interactivity to Web pages. In fact, as the Web becomes more dependent on programming, and less on static HTML, we may not even refer to “Web pages” anymore. And we may find ourselves yearning for the good old days of the Web … a couple of months ago!