How to Parse the Google Results Page using Regular Expressions Even if You’re a Total PHP n00b

 Google Scraping

There comes a time in the course of every SEO’s life where they find themselves wishing to scrape their own search rankings data rather than use one of the fine commercial tools available for the task.  Fortunately for me, not every SEO is also a hacker, so I do a lot of work with professional online marketing consultants in need of a web programmer who speaks fluent SEO to turn their geeky daydreams into working apps.  But what if you can’t afford me?


Well, there are myriad reasons to write your own script to parse Google search result pages, but it does bring up the sticky issue of how you obtained them.  According to Google’s Terms of Service:

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

Just look at the SeoQuake and RankQuest SEO tool plugins for a few examples of public, popular, in-your-face violations of this TOS clause and you’ll quickly realize how moot this rule can be, or at least how arbitrarily it may be enforced by Google.  Further, their IP-blocking policies limit the rate at which one IP address (presumably some distinct machine, either server or workstation) can perform search queries, which is a kind of tacit approval of a certain volume of automated queries, don’t you think?  But really, I’m not about to condone violating Google’s TOS, so…

WARNING: The following instructions are for tobacco use only.

Now, assuming that either you or your college intern (or your undocumented computer-literate immigrant, or your Amazon Mechanical Turksters) have meticulously loaded and saved the HTML source of the search engine results pages (SERPs) for each and every term you’re tracking, how do you then mine all that gloriously-raw organic search intelligence into usable numerical rankings, URL lists, trend graphs, competitor discovery and low-hanging-fruit reports, without having to pay and trust your Google-TOS-abiding wage slaves to do it with no margin for error?

Parsing!

Yes, that’s right.  The same technology your browser uses to visually render your search results can also be used to determine rank positions and pull out the URLs and domains from all the tags that make it look pretty, simply by studying the patterns those tags form and modeling it in the elegant voodoo language of regular expressions in order to describe the boundaries around each kind of data such that you can reliably cut out just the information you need.

A Google SERP has only a few key areas of interest to most SEO’s interested in rank tracking.  Let’s start off with the easiest part which only occurs once per page; the index count:

 

which is made from these simple ingredients:

<p id=resultStats> Results <b>1</b> - <b>10</b> of about <b>1,730,000</b> <b>English</b> pages for <b>seo programmer</b>. (<b>0.27</b> seconds) </div>

In PHP, using the preg_match() function from PHP’s Perl-Compatible Regular Expression library, you would parse the above like this:

preg_match('@Results (.+?) \- (.+?) of[ about]* (.+?)@',$html,$matches);

The circa (@) characters mark the start and end points of the actual regular expression pattern.  You can use any character you like, but it can’t be ambiguous with any characters appearing in the pattern string itself, and I’ve found circa to be an excellent choice for this.

Anything in (parentheses) is a subpattern, which will be isolated into its own array element in $matches.  These are basically the crosshairs that identify the item you’re actually trying to parse out, and doing so will place the strings matching these subpatterns into their own nested array elements within $matches for easy access by your script.

The first subpattern above defines the number of the first rank position displayed on this page, so if you’re on page 2 with the default 10-per-page result set, that number will be 11.  Dig?

So in these subpatterns in the above example; “(.+?)”, the period (.) means to match “any one character” but the plus (+) means “one or more times”, so in the case of the first subpattern, it’s looking for one or more characters of any kind appearing after the space following the word “Results”.

The way regular expressions are parsed, this will match the rest of the file after “Results ” so we put a question mark (?) after the plus to say “but don’t be greedy” or “stop after the first occurrence of whatever is to the right of this subpattern” to be exact.  Which is good, because we might want to separate the From number from the To number, which are delimited by a space, a dash, and a space. (e.g., 1 – 10).  Except the dash character (-) is a reserved symbol in regular expressions which is used to indicate a range of matching characters such as a-z, A-Z or 0-9.  So to use a dash as a literal part of the pattern, you “escape” it with a backslash (\), hence ” \- “.

Now in some cases such as with very specific queries, the result set may be small enough for Google to know precisely how many results are indexed, so it may just say “of X” rather than “of about X”.  So how do we account for this variation?  You can say ” about*” but this means “definitely ‘ abou’, then zero or more ‘t’s” because the asterisk refers to the thing it follows, and in this case that’s a ‘t’.  You must put the whole optional string in brackets so the asterisk refers to the whole string, and not just the character before it, hence “[ about]*”.  Oh, the things you learn…

The third subpattern is our index count, which may be useful to track over time to show the progress of Google deep-indexing a large site, or to follow the changes to the index due to the evolution of your SEO efforts.

After preg_match() finishes, you’ll have a variable named $matches which looks like this after you var_dump() it:

array(4) {
[0]=> string(52) "Results <b>1</b> - <b>10</b> of about <b>1,730,000</b>"
[1]=> string(1) "<b>1</b>"
[2]=> string(2) "<b>10</b>"
[3]=> string(7) "<b>1,730,000</b>"
}

Now the only thing left to do is strip those <b> tags before you assign these matches to variables in your parser script, which is probably going to build a report, draw a graph, or insert these numbers into a database.   We could have put the <b> tags right in the regular expression, but what if when Google decides subtly alter the SERP markup?  I would assume the phrasing of “Results 1 – 10 of about 1,000,000″ is more likely to remain intact long-term than the formatting markup they put around the numbers, so a quick pass through PHP’s strip_tags() function gives you clean-shaven data every time:

$firstResultPosition = strip_tags($matches[1]); # "1"
$resultsPerPage = strip_tags($matches[2]);      # "10"
$indexCount = strip_tags($matches[3]);          # "1,730,000"

That was easy.  But what about the rankings?

Well, I’m glad I asked.  You see, parsing Google results looks like it should be pretty easy, but they do periodically change the SERP markup to accommodate new features or possibly just to mess with our heads, I don’t know.  The point is, it’s considerably more complicated than it might look when you’re just starting out, thus the following parser is based on a series of nervous breakdowns enlightening encounters with the uncommon conditions which trigger various conditional markup in the SERP.

At any rate, here’s a brief run-down, using the following current #1 Google ranking for “seo programmer”:

seo programmer top ranking google

Which looks like this if you have brains full of silicon and/or Futurama:

<h3 class=r><a href="http://hackingsearch.com/" class=l onmousedown="return clk(this.href,'','','res','1','','0CA4QFjAA')"><em>SEO Programmer</em> Ryan Smith Technical Search Engine Optimization <b>...</b></a></h3><div class="s">Search industry news, <em>SEO</em> research and open-source <em>SEO</em> tools, from the ubergeek perspective.<br><cite>hackingsearch.com/ - </cite><span class=gl><a href="http://74.125.93.132/search?q=cache:uINYMuuiPA0J:hackingsearch.com/+seo+programmer&cd=1&hl=en&ct=clnk&gl=us" onmousedown="return clk(this.href,'','','clnk','1','')">Cached</a> - <a href="/search?hl=en&safe=off&q=related:hackingsearch.com/">Similar</a></span></div>

Fortunately my listing is a simple one, so let’s use this as an example, shall we?  The last time I checked, the following code parsed Google search results just fine:

preg_match_all('@class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]@m',$html,$matches,PREG_SET_ORDER);

Update: the SERP markup changed since publication, so give this version a try instead:

preg_match_all('@class="g w0"><.*? class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)".+?<\/div>@m',$this->HTMLresponse,$matches,PREG_SET_ORDER);

It may be easier to read this if you squint a little, to somehow blur out the syntax and better appreciate the progression.

Ok, basically each organic result block begins with a <div class="g w0"> followed by some tag with class="r" within which the snippet (and possible some sitelinks, which we won’t get into here for the sake of simplicity) falls between a tag with a class="s" (or class=s) and a <cite> tag followed by a tag with class="gl" (or class=gl) which delimits the URL, Cache and Similar links from the title and snippet. The PREG_SET_ORDER constant switches the output format so each result found by preg_match_all() gets its own array element.  Catch all that?

No matter, the code above will generate a $matches array you can loop over like so:

foreach ($matches as $offset => $thisResult) {
     $results[$offset]['url'] = $thisResult[1];
     $results[$offset]['title'] = strip_tags($thisResult[2]);
     $results[$offset]['snippet'] = strip_tags($thisResult[3]);
     $results[$offset]['cacheUrl'] = $thisResult[4];
}

Now in this, $results[0] is the #1 ranking position, and $results[9] is the 10th position.  Thanks to immutable laws of physics and entropy and warp speeds and such, computers start counting at zero.  Deal with it.

Unfortunately for you, the above code example does not parse out blended results such as news, images and video, since Google typically encases them all in a table, which requires slightly smarter logic than you see above.  Smarter and more expensive —  I can’t go giving everything away for free, now can I?  So you’re going to get an empty $results element for each blended result and like it. Oh, and sitelinks will appear in your snippet result element, so you might want to parse them out before you strip_tags() the whole thing, that is, if you care about either snippets or sitelinks.

Now it’s up to you to do something useful with $results, like looping over it with a foreach() and applying the information to a database or to some type of data display, which I’m also not going to get into here.   If you know how to do that you’re probably just skimming for the monospace-fonted text anyway…

Are rankings still relevant?

In the emerging age of personalized search results which remove the guarantee that all users are seeing your site at similar rank positions for the same searches, the debate rages on amongst SEO’s as to whether organic search rankings are still a relevant metric for gauging your site’s search performance.

Despite this likely growth-trend in personalization of results, I firmly hold that unpersonalized organic search rankings, especially when based on the average of multiple, geographically-diverse unpersonalized results, still yield valuable, actionable intelligence when understood as a baseline upon which personalization will factor. If you’re ranking in the top 1-2 positions, aren’t you still that much more likely to show up in a personalized result set for that keyword, for the majority of searchers, than if you were ranking 5-6?

It also wouldn’t hurt to compare rankings (and conversions) between geographic areas, or based on the average position given by various data centers, nor would it sting too badly to compare unpersonalized rankings with personalized results of various search personas, but gathering any rankings data on targeted and well-converting terms upon which to gauge SEO performance relatively over time remains, in my opinion, absolutely necessary for many search professionals.

Happy hacking!

Tags: , , ,


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button