Google wants to explore the deep web

by Patrick Altoft on February 23, 2009

The NY Times has an article today discussing how Google and other search engines are trying to index the “deep web” – databases and other content that were previously invisible.

Last year Google started entering random keywords into millions of search forms on the web to try and expose the databases of results that lie behind them.

To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt,” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.

That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.

“This is the most interesting data integration problem imaginable,” says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

Patrick Altoft is Director of Search at Leeds based digital & SEO agency Branded3. Patrick also runs Blogstorm.

You can get our blog posts delivered for free by email every day - simply add your email address to the box below or alternatively grab the RSS feed.

Read some similar posts

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

{ 4 comments… read them below or add one }

Andy 23 Feb 2009 at 6:34 pm

Hmm, yes. Great plan that was.

The old search app and product database we used to have on one of our sites got hit hundreds of thousands of times, and now we’ve removed it and replaced it with another all those spidered search results are coming up as dead links in Webmaster Tools…

The de-index request in Webmaster Tools doesn’t allow the use of wild cards to remove dynamic URLs indexed in this manner so, much as I’d like to, I can’t just get the whole lot dropped.

Hopefully the problem will be mitigated by a 301 redirect I added at the weekend when a complete traffic wipe-out alerted us to trouble… though whether it was this trouble or some other trouble I know not.

Nick Stamoulis 23 Feb 2009 at 10:18 pm

I think we sometimes forget just how fresh and new the internet really is. Growing in popularity in really just the last ten years or so i think even Google is always fine tuning and making it much more reliable.

Matthew Theobald 24 Feb 2009 at 4:36 am

For a complete deep web solution, the Internet Search Environment Number has developed and ID and cataloging system that will soon be ready for prime time.
Hope you review what materials we have on isen.org and blog.isen.org and take an interest!

Thanks!

-m@

Smith 06 Mar 2009 at 11:26 am

As the internet is growing very vast we cant imagine how big it has become. Google’s Deep Web search strategy involves sending out a program to analyze all the contents of every database it encounters. The Keysearch Analytics blog has tips, tricks, advice and case studies for the search marketing industry.

{ 3 trackbacks }

New Citation Briefs from Thomson Scientific | Educationload.com
02.23.09 at 9:10 pm
Google X Twitter e o campo de batalha « Update or Die
03.23.09 at 3:33 pm
Google x Twitter e o campo de batalha | HSM
03.23.09 at 5:29 pm

Leave a Comment (registration is optional)

Registration is free, takes about 5 seconds and is worth doing.

You can use these HTML tags and attributes:
<a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>