07 Jun 2011

Google Panda Update: Machine Learning and Tree Structures, oh my!

By Ninja Bonnie

Machine learning is using a computer to recognize patterns in data, to then make predictions about new data, based on the pattern recognized or learned from prior chosen training datasets.

One of the ways that Google uses machine learning algorithms in search is to analyze historical data from the logs they keep to analyze and to predict likely future outcomes of search behavior and the satisfaction level of a searcher when they click on a search result and land on any given page. One of the most common methods search engines use to measure the satisfaction of a user is to measure the short clicks and long clicks (as explained by Jim). As Jim explains, the long click is when someone does a search, clicks on a search result link, goes to that page, and does not return to the search engines. This is a good signal to the search engine that the user found what they were looking for. The short click, is when a user goes to a page from a search, and then returns to the search engine and clicks on another result or does another search….this says that the user did not find what they were looking for on the first page…yes, there are exceptions, but this is the norm.

Classification Trees

Although the Google Panda Update is new, Google’s use of machine learning is not. Peter Norvig, Director of Research and Development at Google, has been implementing elements of machine learning in search since 2001. The 25 cent explanation for how classification trees are used in machine learning is simply, a classification tree is trained on a ‘training dataset’, which is usually an artificial dataset mimicking a real one or a historical dataset to ‘learn’ how to classify object or phenomena. They use the data to try to classify pages, or to compare it, it’s like how a vending machine that accepts coins recognizes the difference between a quarter and a nickel by the diameter of the coin you put in. So too do they want to classify pages in search results.

The picture does get a little bit more complex because a classification tree has an importantly different decision procedure from a discriminant analysis. I’ll first tell you what is the difference and then I’ll explain why I think it’s important in the understanding of what their goals with the Panda update were.

Discriminant Analysis A linear discriminant analysis produces a set of coefficients defining the single linear combination of the predictor variables. A score is determined on a linear discriminate function is computed in composite, considering scores for all predictor variables simultaneously.

A classification tree, although involving a decision tree and coefficients, just like the discriminant analysis, is importantly different. A classification tree has a hierarchical decision procedure, in that, to put it loosely and very simply, things are not all calculated at the same time but different calculations happen as the algorithm works its way hierarchically through the predictor variables.

Contemporary use of Tree Structures in Search Generally

Now, there’s an academic paper written by Biswanath Panda (yes, the same Panda that the Panda update was named after) called, “PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce“. This paper discusses classification and regression tree learning on massive datasets, as part of the common data mining aspect of search broadly and more specifically discussing PLANET as “a scalable tree learner with accuracy comparable to a traditional in-memory algorithm but capable of handling much more training data.”

Of course Google would want to be able to have a tree learner be able to handle its massive search datasets.

There are two important things to note about this document by Biswanath Panda. First, he refers to the classification tree structures specifically as being part of their study. As well, and I thought this was very interesting, to test their heavy-duty tree learner PLANET’s effectiveness they tried it on the bounce rate prediction problem. According to Panda and his team:

“We measure the performance of PLANET on bounce rate predication problem [22, 23]. A click on an sponsored search advertisement is called a bounce if the click is immediately followed by the user returning to the search engine. Ads with high bounce rates are indicative of poor user experience and provide a strong signal of advertising quality. The training dataset (ADCORPUS) for predicting bounce rates is derived from all clicks on search ads from the Google search engine in a particular period. Each record represents a click labeled with whether it was a bounce. A wide variety of features are considered for each click…So what this means is that using the method of PLANET, they were able to scale these complex tree structure learning calculations surrounding the bounce rate prediction problem successfully and scalability.”

In his post discussing the topic, Searching Google for Big Panda and Finding Decision Trees, Bill Slawski connects the prior mentioned document with the panda update, stating that

“….while the authors are focusing upon problems in sponsored search with their experimentation, they expect to be able to achieve similarly effective results while working on other problems involving large scale learning problems….the Farmer/Panda update does appear to be one where a large number of websites were classified based upon the quality of content on the pages within those sites.”

My Thoughts On Machine Learning in Organic Search & Panda Update

It sounds like the training dataset was put together by the Google quality raters. As stated in the initial interview with Matt and Amit, Matt said that there was an engineer who came up with a rigorous set of questions, everything from. “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?” Questions along those lines.

The results of these pages/sites were probably mixed with another group of mixed pages/sites into a training dataset. After probably A LOT of testing and tweaking, the algorithm was let loose on the world to see how accurately it predicted low quality pages and sites. Amit later did offer his 23 guidelines for building a high quality site which may very well have been some of the questions asked in the real dataset….but this document is still all “concepts”. It is almost as if they are giving us their goals for what they want to algorithmically measure, but they still do not give any specific measurements or tell us any of the real variables involved.

Remember how one feature of classification trees, relative to distributive trees, is that they process hierarchically instead of simultaneously. I would conclude that Google is probably using classification trees to check for one set of factors first then split, check for another set of factors, depending on how the math works out for that, it either goes on to do further processing or stops. So, I wonder if my prior evaluation is correct, there is a reasonable possibility that the Panda Update includes some kind of classification tree, with a hierarchical decision structure.

Does this mean that there are some panda factors that are necessarily more important then others? Does this mean that there is that initial combination of factors that expose your site to further scrutiny, which lead charmingly to getting pooped on by the Panda.What are those factors? It is hard to say exactly with some speculation. My best guess is:

Some % of clickbacks to Google, for phase a, on a site of at least x size, with y amount of content on page [ content sub variables t, w] , and on a page level, the page having properties of p, q, l in some combination.

The best first step I think, although by this time I think all those hit by panda took action, is to identify your lowest performing pages that ALSO used to rank well for long tail phases but then dropped. Once these pages are identified, there are two options. Either ax the pages OR change the pages entirely. My point is this though, there are definitely a small number of factors that came together to bring the wrath of the panda update on your site. Being proactive by getting rid of the really bad stuff or rethinking the content of your pages is the key thing to bring you back. I don’t mean that you suddenly have to become Stanford encyclopedia of philosophy, but do just enough to bring you back. That’s the game isn’t it, the little changes are everything. A little after I initially finished this post I stumbled on this great post, You&A with Matt Cutts. One of the things that popped out at me was this quote, which was in reply to a question about site usability as being a partial ranking factor, with Cutts saying, “Panda is trying to mimic our understanding of whether a site is a good experience or not. Usability can be a key part of that. They haven’t written code to detect the usability of a site, but if you make a site more usable, that’s good to do anyway.” I think my big lesson from some of my research and thought that I published here and Matt Cutts concession above is that Google and search is limited in what they can hope to measure. It is almost philosophical, can a machine really understand what it means to be ‘quality’ from data mining?

*Google Panda Update – Google’s Content guidance and Jim’s Take – Here is a great post decoding the 23 guidelines

Additional Resources:

  • Peter Norvig’s Website: I am a Peter Norvig super fan, check out his website. Brilliant man but not so brilliant at graphic design hehe