Like this article? PLEASE +1 it! Evan Signature
Evan Carmichael Top Header
Share for a Cause









Process of website indexing by Google & other Search Engines

Written by: Atul Gupta

Article Overview: There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about exact working of search engine indexing process since most search engines offer limited information about how they architect the indexing process. Webmasters get some clues by checking their log reports about the crawler visits but are unaware of how the indexing happens or which pages of their website were really crawled.

Free Download - Analysis and Implications of Hilltop Algorithm By Atul Gupta
Name: Email:

Process of website indexing by Google & other Search Engines

There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about exact working of search engine indexing process since most search engines offer limited information about how they architect the indexing process. Webmasters get some clues by checking their log reports about the crawler visits but are unaware of how the indexing happens or which pages of their website were really crawled.

While the speculation about search engine indexing process may continue, here is a theory, based on experience, research and clues, about how they may be going about indexing 8 to 10 billion web pages even so often or the reason why there is a delay in showing up newly added pages in their index. This discussion is centered around Google, but we believe that most popular search engines like Yahoo and MSN follow a similar pattern.

* Google runs from about 10 Internet Data Centers (IDCs), each having 1000 to 2000 Pentium-3 or
Pentium-4 servers running Linux OS.

* Google has over 200 (some think 'over 1000') crawlers / bots scanning the web each day. These
do not necessarily follow an exclusive pattern, which means different crawlers may visit the
same site on the same day, not knowing other crawlers have been there before. This is what
probably gives a 'daily visit' record in your traffic log reports, keeping web masters very
happy about their frequent visits.

* Some crawlers' jobs are only to grab new URLs (lets call them 'URL Grabbers' for convenience)
- The URL grabbers grab links & URLs they detects on various websites (including links
pointing to your site) and old/new URL's it detects on your site. They also capture the 'date
stamp' of files when they visit your website, so that they can identify 'new content' or
'updated content' pages. The URL grabbers respect your robots.txt file & Robots Meta Tags so
that they can include / exclude URLs you want / do not want indexed. (Note: same URL with
different session IDs are recorded as different 'unique' URLs. For this reason, session ID’s
are best avoided, otherwise they can be misled as duplicate content. The URL grabbers spend
very little time & bandwidth on your website, since their job is rather simple. However, just
so you know, they need to scan 8 to 10 Billion URLs on the web each month. Not a petty job in
itself, even for 1000 crawlers.

* The URL grabbers write the captured URL's with their date stamps and other status in a 'Master
URL List' so that these can be deep-indexed by other special crawlers.

* The master list is then processed and classified somewhat like -
a) New URLs detected
b) Old URLs with new date stamp
c) 301 & 302 redirected URLs
d) Old URLs with old date stamp
e) 404 error URLs
f) Other URLs

* The real indexing is done by (what we're calling) 'Deep Crawlers'. A deep crawler’s job is to
pick up URLs from the master list and deep crawl each URL and capture all the content - text,
HTML, images, flash etc.

* Priority is given to ‘Old URLs with new date stamp’ as they relate to already indexed but
updated content. ‘301 & 302 redirected URLs’ come next in priority followed by ‘New URLs
detected’. High priority is given to URLs whose links appear on several other sites. These are
classified as 'important' URLs. Sites and URL's whose date stamp and content changes on a
daily or hourly basis are 'stamped' as 'News' sites which are indexed hourly or even on
minute-by-minute basis.

* Indexing of ‘Old URLs with old date stamp’ and ‘404 error URLs’ are altogether ignored. There
is no point wasting resources indexing ‘Old URLs with old date stamp’, since the search engine
already has the content indexed, which is not yet updated. ‘404 error URLs’ are URLs collected
from various sites but are broken links or error pages. These URLs do not show any content on
them.

* The 'Other URLs' may contain URLs which are dynamic URLs, have session IDs, PDF documents,
Word documents, PowerPoint presentations, Multimedia files etc. Google needs to further
process these and assess which ones are worth indexing and to what depth. It perhaps allocates
indexing task of these to 'Special Crawlers'.

* When Google 'schedules' the 'Deep Crawlers' to index 'New URLs' and '301 & 302 redirected
URLs', just the URLs (not the descriptions) start appearing in search engines result pages
when you run the search "site:www.domain.com" in Google. These are called 'supplemental
results', which mean that Deep Crawlers shall index the content 'soon' when the crawlers get
the time to do so.

* Since Deep Crawlers need to crawl 'Billions' of web pages each month, they take as many as 4
to 8 weeks to index even updated content. New URL’s may take longer to index.

* Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then
processed, sorted and replicated (synchronized) to the rest of the IDCs. A few years back,
when the data size was manageable, this data synchronization used to happen once a month,
lasting for 5 days, called 'Google Dance'. Nowadays, the data synchronization happens
constantly, which some people call 'Everflux'

* When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending
upon their speed and availability. Since the data at any given time is slightly different at
each IDC, you may get different results at different times or on repeated searches of the same
term (Google Dance).

* Bottom line is that one needs to wait for as long as 8 to 12 weeks, to see full indexing in
Google. One should consider this as 'cooking time' in 'Google's kitchen'. Unless you can
increase the 'importance' of your web pages by getting several incoming links from good sites,
there is no way to speed up the indexing process, unless you personally know Sergey Brin &
Larry Page, and have a significant influence over them.

* Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even a
small data can create unlimited URLs, which can clutter Google index with duplicate content.

Summary & Advise:

1. Ensure that you have cleared all roadblocks for crawlers and they can freely visit your site
and capture all URLs. Help crawlers by creating good interlinking and sitemaps on your
website.
2. Get lots of good incoming links to your pages from other websites to improve the 'importance'
of your web pages. There is no special need to submit your website to search engines. Links to
your website on other websites are sufficient.
3. Patiently wait for 4 to 12 weeks for the indexing to happen.

Disclaimer: The actual functioning and exact architecture of the search engines may vary but in essence, this is what we believe they do.


© Copyright 2006, RedAlkemi

Related Articles
  Why sitemaps are important in SEO?
  ••••••>Understanding the Search Engines Spiders. What do Algorithms seek for Top Rankings and Excellent SEO
  Latent Semantic Indexing and PaIR for Dummies
  The Power Of Search Engine Optimisation
  ••••••>Check Out Your Major Competition to Improve Your Own SEO Rankings and Methods

Home > SEO > Atul Gupta > Process of website indexing by Google other Search Engines
Article Tags: Google, search engine indexing, search engines, Yahoo



Related Forum Posts
Re: SES Toronto Next Week Re: SES Toronto Next Week - In that case, how about the following? Track: Get Me Up to Speed * State of Search Marketing in Canada Track: Practical & Actionable * Beyond Linkbait: Getting Authoritative Online Mentions Track: Advanced * User Search Behavior * Social Media Success * Web 2.0 & Search Engines * Giving Credit Where It’s Due: Which Campaign Sold What?
Re: How can I promote my site? Re: How can I promote my site? - There are various useful techniques for promoting a website. In SEO the techniques could be "white hat" and "black hat" techniques. White hat techniques includes following techniques to promote website as per the search engine rules. - Search Engine Friendly Tags - Website submission to Search Engines - Website Submission to Social Sites (Like twitter, Facebook) - Articles Promotion & Press Release Promotion - Quality Directories promotion - Contribution to Blogs, Forums Black hat techniques can reduce website strength in the search engine and can treat site as spam that could be following. - Keywords Stuffing. - Hidden Links - Maximum Links of unknown sites.
Re: SEO Recommendation Needed Re: SEO Recommendation Needed - One who gives guarantee of top position in Search Engines surely does not know much about SEO
Re: Blogging for Entrepreneurs Re: Blogging for Entrepreneurs - Hi, Plus Search Engines LOVE blogs. If you can update your blog a few times a week that will go along way. Plus you like David said linking to internal pages (use good anchor text) can help other pages of your website be indexed faster & higher in the search engines. Jeff
Re: Poll: Blogger or Wordpress Blog? Re: Poll: Blogger or Wordpress Blog? - I think an advantage of using Wordpress for the blog is related to Google. I read the Google is indexing the pages faster and in this way they will rank better in searches. This was the main reason when we have decided to create the company's blog.


Recommended Article for You close

  Why sitemaps are important in SEO?

Share this article with your friends. Fund someone's dream.

Leave a comment below or share on the left and you'll help support entrepreneurs in Africa through our partnership with Kiva. Over $50,000 raised and counting - Please keep sharing! Learn more.



Featured Article


Bottom Footer
Share for a Cause












Newsletter

Get advice & tips from famous business
owners, new articles by entrepreneur
experts, my latest website updates, &
special sneak peaks at what's to come!
Name:
Email:
Popular Articles

Coaching - An Effective Tool for Managers

Resistance to Change and How to Deal With It

Online Business Ideas: A Look At Various Options

Suggestions

Email us your ideas on how to make our
website more valuable! Thank you Sharon
from Toronto Salsa Lessons / Classes for
your suggestions to make the newsletter
look like the website and profile younger
entrepreneurs like Jennifer Lopez.