Google wants to explore the deep web

by Patrick Altoft on / 7 responses

The NY Times has an article today discussing how Google and other search engines are trying to index the “deep web” – databases and other content that were previously invisible.

Last year Google started entering random keywords into millions of search forms on the web to try and expose the databases of results that lie behind them.

To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt,” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.

That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.

“This is the most interesting data integration problem imaginable,” says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

Patrick Altoft is Director of Search at Branded3, a Leeds SEO & Digital Agency specialising in SEO, Web Design, Development & Social Media.

Get daily posts direct to your inbox

You can get our blog posts delivered for free by email every day - simply add your email address to the box above, or alternatively you can grab the RSS feed.

Comments

Read the 4 comments below, or add your own!

February 23, 2009 at 6:34pm

Hmm, yes. Great plan that was.

The old search app and product database we used to have on one of our sites got hit hundreds of thousands of times, and now we’ve removed it and replaced it with another all those spidered search results are coming up as dead links in Webmaster Tools…

The de-index request in Webmaster Tools doesn’t allow the use of wild cards to remove dynamic URLs indexed in this manner so, much as I’d like to, I can’t just get the whole lot dropped.

Hopefully the problem will be mitigated by a 301 redirect I added at the weekend when a complete traffic wipe-out alerted us to trouble… though whether it was this trouble or some other trouble I know not.

Reply

February 23, 2009 at 10:18pm

I think we sometimes forget just how fresh and new the internet really is. Growing in popularity in really just the last ten years or so i think even Google is always fine tuning and making it much more reliable.

Reply

February 24, 2009 at 4:36am

For a complete deep web solution, the Internet Search Environment Number has developed and ID and cataloging system that will soon be ready for prime time.
Hope you review what materials we have on isen.org and blog.isen.org and take an interest!

Thanks!

-m@

Reply

March 6, 2009 at 11:26am

As the internet is growing very vast we cant imagine how big it has become. Google’s Deep Web search strategy involves sending out a program to analyze all the contents of every database it encounters. The Keysearch Analytics blog has tips, tricks, advice and case studies for the search marketing industry.

Reply

3 trackbacks

Leave a comment

Your email address will not be published. Fields marked with an asterisk are required.
 

  *

  *

You can use one of the following tags:
<a href=""><blockquote><code><em><strike><strong>