Feedback Form
Home Features Mastermind Forums About Advertise Blog Network Contact Be An Author

How Google Applies Science to Search

How Google Applies Science to Search

Dr. Craig Nevill-Manning is a New Zealander who joined Google in 2000 as a Senior Research Scientist to develop more precise search techniques. Previously, Craig was an assistant professor at the Computer Science Department of Rutgers University, where he conducted research in data compression, information retrieval and computational biology. Before that, he was a post-doctoral fellow in the Biochemistry Department of Stanford University, where he developed a software suite used by pharmaceutical research laboratories to identify the role of particular proteins within cells.

A scientist at heart, Craig is probably best known as the developer of Froogle (recently re-named Google Product Search) and the founder of Google's software engineering center in New York City. Google New York is responsible for developing products including Google Maps, Google Finance, Google Spreadsheets, and many important features in web search and advertising. This article is a summary of his presentation at Webstock 2008 (http://www.webstock.org.nz/).


Google’s Spelling Bee

Craig started his presentation by talking about one of his first challenges: Google's spelling correction tool. As the popularity of the search engine grew, Google needed to be able to spell-correct lots of obscure words. So his solution was to take a sampling of content from the entire web. Craig's team came up with a algorithmic model and ran it over the web. He discovered that there were several correct answers to the same question. For example, words like “kofee” could mean either the searcher is seeking a cup of java or information about Kofee/Kofi Anan.

To combat this, Craig came up with an interesting solution: the "Did you mean?" alternative spelling option, based on predictive examples of searcher spelling patterns. You can see this in action if you type in "kofee anan" in Google. Above the search results is a line that reads: "Did you mean: kofi annan" and links to the search results for this spelling variation too.

But the research went even further. Craig's team worked out how to take into account the context of the search query by studying the 2 or 3 other keywords surrounding the query, for example “kofee cup” or “kofee anan”. The research used the science of bigrams and trigrams to better understand how people search. Bigrams are groups of two written letters, two syllables, or two words, very commonly used as the basis for simple statistical analysis of text. So Craig and his team applied this knowledge to Google's spelling correction system and now, Google's algorithm can determine the searcher's intent with much more accuracy, based on the context of the search query.

As an example of the spelling challenges that Google face, Craig showed the audience the huge number of ways "Britney Spears" is misspelled on the web. He said it's encouraging to see that the most popular spelling is also the most correct one. Scale is important!


Google Maps Lead to Apps

The Google team wrote the code for Google Maps many years ago but the code was actually built into your browser. When Google maps first launched, people took the dense data-script and worked out how to reverse engineer it for their own use. Google engineers decided to release an API key to make these mash-ups easier after seeing so many people reverse engineer Google Maps without Google's help. Now people can mash-up Google maps within minutes to create their own applications.

To show how easy this is, Craig took the audience through the steps to create an interactive application with Google Maps. In the space of about 2 minutes, he signed up for an API key, grabbed the HTML code and pasted it into his page. He then hacked the map to show Wellington Town Hall (our location) and made the point how easy it is to create really useful tools out of technology that is already available.

As an example, Craig showed the audience Seattle Bus Monster http://www.busmonster.com/. This site used an API key for Google Maps to make Seattle bus data and tracking available 24/7. Anyone who needs to catch a bus can look online and instantly find their nearest bus location and run to the bus stop in time to catch it. It’s these type of interactive applications that add value to both corporate and government sites. Craig referenced Rodney Brooks from MIT whose provocative paper "Fast, Cheap and Out of Control" offered new logic and a completely different view of machines. The idea is that there is no center of control among robots so you should make lots of them; don't treat them so precious. Craig said developers should use this logic to create lots of small apps that you can replicate and tweak, rather than one big expensive app that can go horribly wrong. Scale trumps smarts every time!


Experiments in Scale That Have Impacted Google's Operations

Precision vs. Recall

Back in the early 90's, information retrieval on the web was limited to things like Lexus/Nexus. So at that stage, Google would take queries and apply it to the broadest possible search. This was great recall at the cost of precision. But Larry and Sergey wanted something better so they decided to use Boolean search. At the time it was heresy because everything was focused on recall. But the Google founders knew that things had to be super relevant so they developed an algorithm - the core algorithm. It was very simple and relied on Boolean search to determine relevancy.


Genomic Sequencing

In the mid 90's a large project - the Human Genome Project - was underway. The race was on to sequence the genome. Scientists decided to feed this out to a bunch of different people. They chopped up the genome for researchers everywhere and allowed it to replicate. The researchers mapped each chunk with genetic markers and computed a tiling path of tiny fragments.

Sequencing was very expensive, so the data was computed based on a minute number of chunks - very labor intensive. The sequencing took forever and reassembling was a long way off. But then a company came along that said they could do it faster. Sequencing becomes cheaper by automating the job using machines rather than individual people so this company used a clever computer algorithm to conduct the sequencing. This reduced the cost and the researchers were therefore able to reassemble more fragments and achieve a rough draft of the genome in 2000. This sequencing approach was the shotgun approach, where accuracy is lower, but the larger scale allowed the impossible to become possible.


Web Definitions

Google used to do a terrible job of defining terms. Craig noticed people were searching for "definition of...", or "what is a...." etc so he wanted the search engine to provide better results for these searches. He found lots of web pages that contained glossaries and definitions, so he hacked up a Perl script to get the glossary formats.

The first recall results were only 50 percent accurate. He wanted to improve this rate, so he did some experiments with the data. But he could never reach an accuracy level he was happy with. It was later he realized that most of the questions people actually needed answers to could be answered with his crappy little Perl script. He concluded that 100 percent accuracy is not important, that scale is much more important.

Now Google allows you to use the "definition:" query and the question format to get definitions from around the web. Type in "what is a blog?" and you'll get lots of results from Craig's original script.


Protein Sequencing

In biology, Craig says, you're constantly producing proteins. The proteins fold up with particular sequencing. Within computing, you can use this knowledge to do amazing things. You can conduct computations with this type of data but it's time consuming. Somebody at Stanford University noticed that proteins spend a lot of time moving about before folding into an alpha helix. So it was suggested they start the computations with lots of configurations. In this way you can parallelize the data by scale and one will be magically close to a folded protein. So they worked out a way to reduce the problem to a simple process based on mass scale. This is why Google uses maximum scale to conduct algorithmic computations.


Chess vs. Go

You can now compute the value of any potential move in chess. Based on that information, you can compute your projected probability of wining the game from any move. Chess grand masters put a lot of time into this knowledge. But the opposite is true for the game Go, because there is more randomness to the game play.

The smart way (Chess)

- study lots of past games
- compute the probability for each position
- compute far into the future

The stupid way (Go)

- pick moves at random
- re-create the Monte Carlo simulation (http://www.busmonster.com/)(a computational algorithm that relies on repeated random sampling for results)
- play like a human

Curiously, the stupid way works better for Go players because it's more logical to compute the data based on the game's inherent randomness.


How Google Applies the Lessons of Scale

So how does Google apply these lessons of scale? For starters, Google does not buy expensive hardware. PCs are unreliable, especially if you have thousands. However, they are cheap and fast. So what's Google's strategy? Craig says they exploit the processing power of off-the-shelf PC hardware and simply make the software more reliable.

Craig revealed that Google buys cheap hardware on a mass scale. The problem is that these cheap processors are notoriously unreliable because they are packed into datacenters by the thousands and they are running 24 hours a day so they get very hot. Commodity hardware therefore fails at an accelerated rate. Once you cope with that realization, you need to design recovery situations to deal with the problem. So Google's software understands that their data can fail at any moment and works harder to cope with that.

For every server at Google, there is another with exactly the same data on it, the same configuration, the same everything: a clone. Replication is needed for scalability so that if requested data isn't fetched instantly, the backup or clone computer is searched instead. The result is that failures don't hurt Google, they only reduce capacity. When hardware crashes or software hangs, there is a time out and a re-issue request. Google has a central control system in place to manage all this.

Cooling failures at Google can be exciting!. Craig recalls the time when the air conditioning failed entirely at one of the datacenters and the monitoring system recognized that the centre was heating up, so they were able to shut down remaining PCs at the datacenter within minutes. The fire brigade turned up and it was quite a big event internally. But the best thing was that nobody using Google even noticed! Because of Google's scalable solution, searchers were unaffected by the major hardware outage.

Craig says that once a week, a person at each data center has a list of all the failed hard disks and walks around the datacenter with a pile of hard drives, replacing them one at a time. Velcro is Google's secret weapon! All Google's hard disks are velcroed in. This allows super quick service and replacement time. So curiously, there is no downside to hardware failures at Google, because they are expected and managed via scale.


Google: The Startup

Craig showed the audience a photo of Google's original PC configuration put together by Larry and Sergey at google.stanford.edu. It consisted of three hard drives and a couple of monitors. Larry and Sergey used Lego to enclose the hard disks and when Lego became too expensive, they used cheap Lego knock-offs! He then showed a picture of Google's first office inside a residential garage and the hard drive racks that they built in a rented datacenter to save money. Larry and Sergey packed the racks together and used layers of cork between the motherboards so they wouldn't explode. Eventually they hired people who knew about safe wiring, but they still used floor fans in the datacenter to try to keep the PCs cool.


Google and the Brady Bunch

Google's Zeitgeist (http://www.google.com/press/zeitgeist.html) pulls together interesting search trends and patterns generated from the billions of searches conducted on Google. Craig is consistently fascinated by search trends and recalls a particular event that sticks in his mind. On the game show Who Wants to Be a Millionaire, the competitor got down to the final question for $1 million and it was: "On the TV show The Brady Bunch, what is Carol Brady's maiden name?" The competitor used his phone-a-friend lifeline and his friend was able to look it up live on Google and provide the competitor with the correct answer, earning him a million dollars.

The next day, out of interest, Google staff looked at the logs for "carol brady maiden name" and saw a huge spike in traffic when the show aired on the West Coast, then another spike when it aired on the East Coast and then a tiny spike when it aired a few hours later in Hawaii.

So Google Trends is a useful tool to study data patterns, but Google keep a bunch of statisticians on staff who check that random effects aren't making the data significant. Craig says that in the same vein, you should look at your site logs and react, but be careful about jumping to conclusions about what the trends say.

At the end of his presentation, I asked Craig whether he is concerned that Google's PageRank algorithm will gradually become less accurate due to the demands of scale. Craig acknowledged that as Google's indexed data grows, user input and search patterns will become increasingly important. He says PageRank will need to learn to become better at providing search results and scale up accordingly. But scale makes things interesting!





How Google Applies Science to Search - To learn more about this author, visit Kalena Jordan's Website.

Like this article? Share it with your friends

Article Feedback
 Article Feedback No article feedback found.
  Leave Your Feedback
article feedback

Article Feedback
John Alexander
John has taught keyword research and SEO skills to small groups of business owners and Webmasters from over 80 different countries world wide since 2002. John is also the Director of Search Engine Academy ; Co-director of Training at Search Engine Workshops offering live, SEO Workshops with his partner SEO educator Robin Nobles, author of the very first comprehensive online search engine marketing courses at SEO Training Online and the SEO Workshop Resource Center.
I look forward to hearing from you! - Visit John Alexander's Website

Stephanie Robey
Stephanie Robey is President and CoFounder of Pivot Positive, LLC - an Internet marketing business focused on helping people start work at home ventures. Previously, she was employed at The Search Agency with over 20 years experience in graphic design and 10 years experience in online marketing. She was responsible for launching the Conversion Path Optimization (CPO) unit where she and her team have conducted hundreds of optimization tests for online companies across multiple verticals.

She is a successful entrepreneur having started and sold 2 companies and remains on the board of directors of the third, PhotoSpin.com   Stephanie began her career in the direct marketing realm creating and producing direct mail for many of the major cable television companies and directly attributes her understanding of Internet marketing to those early offline experiences.  Stephanie is a graduate of San Diego State University with a BFA in Graphic Arts and also holds an Executive MBA from the Graziadio School of Business and Management at Pepperdine University.

Read Steph's Blog
Meet Steph and Dave
Sign up for our Free 7-Day BootCamp: Self Employed & Rich
- Visit Stephanie Robey's Website


To learn more about the Evan Elite Author Program please contact us.

About The Author


Kalena Jordan
(Visit Kalena's Website) Kalena Jordan is the search industry’s first agony aunt, here to answer your tricky questions about search engines. Kalena is Co-Founder and Director of Studies at Search Engine College, an online training institution offering instructor-led short courses and downloadable self-study courses in Search Engine Optimization and other Search Engine Marketing subjects. She’s been marketing websites online since 1996 and blogging about search since 2002. For her full bio, view Kalena’s Linked In Profile. When not working, Kalena likes to go geocaching, hiking or fishing. She also plays a mean darabuka.

Kalena Jordan is a Platinum author on EvanCarmichael.com
About The Author

View Author Blog
View Author Blog

View Author Video
View Author Video

Free Downloads


Kalena Jordan's

Complete
List Of
SEO
Articles

Name
Email
If you enjoyed this article, get Kalena Jordan's Complete List of SEO Articles For FREE!

More Kalena Jordan
How Compatible Is YOUR Site With Search Engines
How to Submit Your Site to Directories
How to Submit Your Site to Search Engines
Are You Targeting The Wrong Keywords
What Do Busy SEM Professionals Keep in Their Toolkit
Social Media The Instant Brand Killer
Web 2Overwhelming 22 Ways to Frustrate Your Site Visitors
How To Make Search Engines Happy In 3 Easy Steps
Why Online Businesses Should be Blogging
What SEOs Expect From Their Clients
Free Downloads


 
 
 


Evan Elite Authors
Kim Castle  
Staging Diva  
Casey Gollan  
Evan Elite Authors

Become An Author
Have you written articles that would be of value to entrepreneurs? Become an expert on our site by publishing them! Expose yourself to a wide audience, drive more traffic to your website and get more sales! Click Here for details.
Become An Author

Evan's Latest Video
Modeling the Masters: Learn the true secrets behind Walt Disney's business success factors & grow your company! Video produced by Phanta Media
Evan's Latest Video

Business Opportunities
"Learn straight from Evan how you can Make a Full Time Income (And More) from a Website"

How to Start An Online Business

Click Here To Learn More
Business Opportunities



Evan's Newsletter
Get advice & tips from famous business owners, new articles by entrepreneur experts, my latest website updates, & special sneak peaks at what's to come!
Name:
Email:
Evan`s Newsletter

Free Downloads
Employee Engagment Icon Employee Engagment
TeleSeminar Etiquette Icon TeleSeminar Etiquette
The Leader’s Edge Icon The Leader’s Edge
5 Step Action Plan Icon 5 Step Action Plan
Start-Up Capital Estimate Icon Start-Up Capital Estimate
Free Downloads - Complete List

Entrepreneur Tools and Guides
Top 50 Political Blogs
Top 50 Political Blogs
Top Political Blogs of 2009
 
Top 50 SEO Posts of the Year
Top 50 SEO Posts - 2008
Top SEO Posts of the Year
 
Entrepreneur Tools and Guides

SEO For Africa
SEO For Africa
Samuel Asimeh Koforidua, Ghana,
Samuel Asimeh
Koforidua, Ghana
SEO For Africa

If I Were A Startup...
Frank Cianciulli, $2.3 to $7.5 Mil in 2 years
Frank Cianciulli
$2.3 to $7.5 Mil in 2 years
Brian Scudamore, $200k to $8 Mil in 5 years
Brian Scudamore
$200k to $8 Mil in 5 years
If I Were A Startup... - Complete List

Famous Entrepreneurs
Oprah Winfrey, Harpo
Wayne Huizenga, Blockbuster
Wayne Huizenga
Blockbuster
Famous Entrepreneurs - Complete List

Entrepreneur Advice
Ask Michael Gerber, Reader Questions
Ask Michael Gerber
Reader Questions
Michael Gerber, The E Myth
Michael Gerber
The E Myth
Entrepreneur Advice - Complete List

Popular Articles
(Premium Authors)

     Logo Wow!
By Heather Mayoros
     Why Use an Advertising Agency
By Heather Mayoros
     Planning Applies to Advertising Spending Too.
By Heather Mayoros

Have A Suggestion?
Toronto Salsa Classes / Toronto Salsa Lessons Email us your ideas on how to make our website more valuable! Thank you Sharon from Toronto Salsa Lessons / Classes for your suggestions to make the newsletter look like the website and profile younger entrepreneurs like Jennifer Lopez and Sean Combs!
Have A Suggestion?

More Evan Carmichael
More Information