Behind The Scenes At Google

Although it is primarily written with Google Print in mind, SearchEngineWorld has a good article which delves into how Google indexes and caches sites (or books).

As far as I can understand, the index is a database of words made up of the content of pages with useful information (such as whether that word was in a heading or highlighted in bold), which makes cross-referencing the queries that people do quicker and more effective (imagine Google having to search 8 billion pages at every search!).

The cache is a copy of the entire page and can be bought up from Google’s servers if needed, but if people are worried about there being a copy of their document on their servers then it seems it is possible to opt out.

Book publishers in particular were worried that Google Print would basically mean that there was a complete copy of their books avaiable online, whereas in fact Google would be indexing the words of the book for people to search on, letting them know which book they came from but not saving an entire copy of the book for them to read online. It’s a legal grey area that will no doubt take some time to resolve.

This entry was posted on Monday, October 24th, 2005 at 12:18 pm and is filed under SEO, Google. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply


Text Link Ads

Subscribe to our RSS feed Subscribe to our RSS feed

Categories