No Comment

Is Google in Danger of Falling from the Top of the Search Engine Heap? Robert Cringely posts an article today on the subject of Google's $4.6 billion offer to buy the 700MHz wireless spectrum. In the course of writing about this, he incidentally makes an interesting assertion:

Bill Gates likes to talk about how fragile is Microsoft's supposed monopoly and how it could disappear in a very short period of time. Well Microsoft is a Pyramid of Giza compared to Google, whose success is dependent on us not changing our favorite search engine.

Now, I'm not sure that's Google's only advantage -- they have their fingers in a lot of pies, not least of which is ad delivery. But I'm more interested in the question of whether Google is or is not the best search engine. I did some testing recently, motivated by a James Fallows article about a study of search engines financed by Dogpile.com (a search engine aggregator). The conclusion reached by the study was that no individual search engine is providing complete results, so you need a search engine aggregator to get the full picture.

I don't think the conclusions are correct, because the study's methodology was based on unique URLs, rather than testing whether or not the results were useful to a human being or not. I sent the following email to Fallows (the spreadsheet referred to in the text is here -- sorry about the awful MS-generated HTML, as I didn't have the time to redo it properly).

There's a couple of big problems with the story:
1. if you run a search on Google, Yahoo, Ask and Live and then run the same search on Dogpile, the Dogpile results do not actually replicate what shows up in the search engines it's claiming to include.

2. Dogpile returns a lot of bogus search results.

Attached is a spreadsheet that tallies up what's going on for "antiquarian music," a search term of interest to a client of mine (I am their webmaster, programmer and IT support person). What it shows:

1. Eight of the 20 results on Dogpile's first page are IRRELEVANT to the sought-after results. All 10 of the results on the first page of the four search engines are relevant (though not all equally so).

2. None of the Ask.com results are included, despite the fact that Dogpile's search page claims that it's searching Ask.com.

Now, about the individual search results:

Google is by far the most relevant. While it doubles up for two sites, all the other results are relevant, being legitimate antiquarian music dealers. The only exception is the last entry from the Ex Libris mailing list (antiquarian librarians), which is actually an announcement of a catalog the dealer listed #9, so if #9 is relevant, I think that one is, too -- it certainly gives you information directing you to an antiquarian music dealer.

Yahoo includes two links to Harvard Library pages that are not useful (they aren't selling anything), as well as a link to Theodore Front and Schott, both of whom are music publishers/distributors that no longer sell any antiquarian music materials. It also includes the Antiquarian Funks, a Dutch musical group, which obviously doesn't belong, though it takes more than simple computer knowledge to understand that (though Google seems smart enough to figure it out!).

Ask.com adds Katzbichler, a music antiquarian in Munich who doesn't appear in the top 10 results of others, but also includes a worthless link to antiquarian music books on toplivemusic.com, which has nothing at all on it that is relevant to the search. It also gives top billing to Schott, who really offers no significant antiquarian music materials. It also includes a link to a republication of the Open Directory's (DMoz.org) listing for antiquarian music. These listings are republished all over the net and basically just replicate links already found in the main listings.

Live.com also includes the Schott link, as well as two links to the American Antiquarian Society's page on sheet music. This may or may not be relevant, but that would depend on the individual user. I doubt someone looking for antiquarian sheet music would fail to leave out the term "sheet music" in a search, and someone looking for antiquarian music dealers would not be helped by these links. It also includes the Antiquarian Funks and the unhelpful open directory category listing.

So, in short, for this particular search:

1. Dogpile

a. misrepresents the results (it doesn't include what it says it does).

b. Dogpile does a worse job than any of the individual search engines in providing useful links.

2. Of the individual search engines, Google provides clearly superior results, as it filters out several links that don't belong (e.g., Schott, Theodore Front, Antiquarian Funks), though how Google knows such complex information is tough to say.

So, for this particular search, the conclusions of the cited study do not apply. I would expect that there are a number of such searches for which that is the case.

I cannot tell from the description of methodology what could cause this kind of discrepancy, but I am bothered by this on p. 11 of the PDF about the study:

When the display URL on one engine exactly matched the display URL from one or more engines of the other engines a duplicate match was recorded for that keyword.

The problem with that is that it doesn't distinguish equivalent links that could differ. A deep-linked page might be just as useful to a searcher as a link to the home page of an entire website -- this would depend on the type of website and the type of search. In the case of my spreadsheet, I counted all links to any of the websites as equivalent, no matter which page was linked, because for this particular search, that's the way a human being would treat them.

So, I would say that this emphasis on unique URLs is going to skew the results for certain classes of websites and for certain types of searches. Yes, a search that takes you to a specific article on the Washington Post's website is going to be much more helpful than a link to the paper's home page, but for searches like my example, that's just not the case.

Secondly, the emphasis on unique URLs would also not reflect different methods of the different search engines in eliminating duplicates. There can be more than one path to the same information, and if all the search engines do not choose the same path, those URLs would be counted by the study mechanism as different, rather than providing the exact same information.

This study was designed in a way that was guaranteed to make a meta search engine like Dogpile appear to be better. But that is simply not true because of the methodology used -- it is a statistical ghost produced by over-reliance on computer-based determination of URL identity, instead of evaluating from a human being's point of view for equivalent value in different URLs.