Recently I had some thoughts about the way the searching for a position should be, and since I couldn’t stop thinking about it, I’ve decided to to come up with something I believe should be a good and accurate solution for this. I wrote most of this class in the plane on my way to Denver, CO using my trusty tablet, with an app called WebMaster’s HTML Editor and another called View Web Source to view the source of the results. Overall writing a PHP class on android tabled works, but I found out that it takes a lot more time it should take because of the lack of keyboard and copy/paste solution.
Long story short, I created a simple Simple SERP tracker class and after some cleaning of the code generated in the plane, I decided to share my experience with you. Before we move on, let me start with some theory:
What is SERP?
Thanks to Wikipedia I have an easy answer for this:
A search engine results page (SERP), is the listing of web pages returned by a search engine in response to a keyword query. The results normally include a list of web pages with titles, a link to the page, and a short description showing where the Keywords have matched content within the page. A SERP may refer to a single page of links returned, or to the set of all links returned for a search query.
What is SERP Tracker?
The tracker crawls trough every page on the search results for a specific keyword and it looks for the first appearance of your site on it, effectively replacing the need of doing this by “hand”.
Ok, so now that we know this, let’s start creating the class itself. Every class like this needs to perform at least three basic functions: crawl, parse and find. Below you will find description for each one of those:
Parse
Gets the array with URLs with the specific keyword to be searched from the crawl() method , processes it and passes the resulting HTML to the crawl() method.
Crawl
Gets the html, sends it to the find() method and waits for the result. It decides if it should pass another set of URLs based on the result of the method.
Find
Looks into the provided HTML for a specific string (a website URL in our case) and gives the result back to the crawl(), in order for it to continue searching or stop, depending of the result. This method will process the given HTML differently for each search engine, but it will return the same results: the position of the result (if found), or FALSE. This method will be abstract in the parent class because of it’s nature.
Those functions are generic, and are used for every search engine, so we need to create an abstract class with all the needed requirements, which later will be extended for a specific search engine. You can see them in the parent abstract class:
<?php /** * Simple SERP Tracker class * * http://avoev.com/simple-serp-tracker-php-class * * @copyright Andrey Voev 2011 * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. * * @author Andrey Voev <andreyvoev@gmail.com> * @version 1.0 * */ abstract class Tracker { // the url that we will use as a base for our search protected $baseurl; // the site that we are searching for protected $site; // the keywords for the search protected $keywords; // the current page the crawler is on protected $current; // starting time of the search protected $time_start; // debug info array protected $debug; // the limit of the search results protected $limit; // proxy file value protected $proxy; public $found; /** * Constructor function for all new tracker instances. * * @param Array $keywords * @param String $site * @param Int $limit OPTIONAL: number of results to search * @return tracker */ function __construct(array $keywords, $site, $limit = 100) { // the keywords we are searching for $this->keywords = $keywords; // the url of the site we are checking the position of $this->site = $site; // set the maximum results we will search trough $this->limit = $limit; // setup the array for the results $this->found = array(); // starting position $this->current = 0; // start benchmarking $this->time_start = microtime(true); // set the time limit of the script execution - default is 6 min. set_time_limit(360); // check if all the required parameters are set $this->initial_check(); } /** * Initial check if the base url is a string and if it has the required "keyword" and "position" keywords. */ protected function initial_check() { // get the model url from the extension class $url = $this->set_baseurl(); // check if the url is a string if(!is_string($url)) die("The url must be a string"); // check if the url has the keyword and parameter in it $k = strpos($url, 'keyword'); $p = strpos($url, 'position'); if ($k === FALSE || $p === FALSE) die("Missing keyword or position parameter in URL"); } /** * Set up the proxy if used * * @param String $file OPTIONAL: if filename is not provided, the proxy will be turned off. */ public function use_proxy($file = FALSE) { // the name of the proxy txt file if any $this->proxy = $file; if($this->proxy != FALSE) { if(file_exists($this->proxy)) { // get a proxy from a supplied file $proxies = file($this->proxy); // select a random proxy from the list $this->proxy = $proxies[array_rand($proxies)]; } else { die("The proxy file doesn't exist"); } } } /** * Parse the result from the crawler and pass the result html to the find function. * * @param String $single_url OPTIONAL: override the default url * @return String $result; */ protected function parse(array $single_url = NULL) { // array of curl handles $curl_handles = array(); // data to be returned $result = array(); // multi handle $mh = curl_multi_init(); // check if another URL is supplied $urls = ($single_url == NULL) ? $this->baseurl : $single_url; // loop through $data and create curl handles and add them to the multi-handle foreach ($urls as $id => $d) { $curl_handles[$id] = curl_init(); $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d; curl_setopt($curl_handles[$id], CURLOPT_URL, $url); curl_setopt($curl_handles[$id], CURLOPT_HEADER, 0); curl_setopt($curl_handles[$id], CURLOPT_RETURNTRANSFER, 1); if($this->proxy != FALSE) { // use the selected proxy curl_setopt($curl_handles[$id], CURLOPT_HTTPPROXYTUNNEL, 0); curl_setopt($curl_handles[$id], CURLOPT_PROXY, $this->proxy); } // is it post? if (is_array($d)) { if (!empty($d['post'])) { curl_setopt($curl_handles[$id], CURLOPT_POST, 1); curl_setopt($curl_handles[$id], CURLOPT_POSTFIELDS, $d['post']); } } // are there any extra options? if (!empty($options)) { curl_setopt_array($curl_handles[$id], $options); } curl_multi_add_handle($mh, $curl_handles[$id]); } // execute the handles $running = null; do { curl_multi_exec($mh, $running); } while($running > 0); // get content and remove handles foreach($curl_handles as $id => $c) { $result[$id] = curl_multi_getcontent($c); curl_multi_remove_handle($mh, $c); } // close curl curl_multi_close($mh); // return the resulting html return $result; } /** * Crawl trough every page and pass the result to the find function until all the keywords are processed. */ protected function crawl() { $this->setup(); $html = $this->parse(); $i = 0; foreach($html as $single) { $result = $this->find($single); if($result !== FALSE) { if(!isset($this->found[$this->keywords[$i]])) { $this->found[$this->keywords[$i]] = $this->current + $result; // save the time it took to find the result with this keyword $this->debug['time'][$this->keywords[$i]] = number_format(microtime(true) - $this->time_start, 3); unset($this->keywords[$i]); } // remove the keyword from the haystack unset($this->keywords[$i]); } $i++; } if(!empty($this->keywords)) { if($this->current <= $this->limit) { $this->current += 10; $this->crawl(); } } } /** * Prepare the array of the keywords for every run. */ protected function setup() { // prepare the url array for the new loop unset($this->baseurl); foreach($this->keywords as $keyword) { $url = $this->set_baseurl(); $url = str_replace("keyword", $keyword, $url); $url = str_replace("position", $this->current, $url); $this->baseurl[] = $url; } } /** * Start the crawl/search process. */ function run() { $this->crawl(); } /** * Return the results from the search. * * @return Array $this->found */ function get_results() { return $this->found; } /** * Return the debug information - time taken, etc. * * @return Array $this->debug */ function get_debug_info() { return $this->debug; } /** * Set up the base url for the specific search engine using "keyword" and "position" for setting up the template. * * @return String $baseurl; */ abstract function set_baseurl(); /** * Find the occurrence of the site in the results page. Specific for every search engine. * * @param String $html OPTIONAL: override the default html if needed * @return String $baseurl; */ abstract function find($html); } ?>
Let me describe each of the functions in this class shortly:
- __construct()
It sets the basic parameters such as keywords, the url of the site we are searching for, the limit of the results to search in and the start time of the execution. At the end, it runs initial_check()
- initial_check()
Makes sure that the URL supplied by the child class using the abstract method set_baseurl() contains the required keywords “keyword” and “position”. This URL will be used as a template to generate the actual URLs for the crawl() method. If the requirements are not met, it will stop the execution.
- use_proxy($file = FALSE)
Making a lot of requests to a search engine raises a red flag, so eventually you will get a 302 redirect from it (Google redirects the user to a page with captcha, to make sure that the user is not a bot). One of the most effective ways to combat this, is to use proxy. If you run the method from withing the child class, supplying a txt file with proxy IP’s, the class will use a random line from it, before it makes the request to the search engine.
- parse(array $single_url = NULL)
One of the important functions in the class: it initializes a new cURL multi handle, allowing us effectively to perform multiple requests to the search engine. It uses the $this->baseurl, which contains array of already pre-made URL’s for every keyword supplied in the constructor. As a result it returns another array with the HTML strings for every result page of the request. We can override $this->baseurl if $single_url is supplied as argument.
- crawl()
Another important method, mentioned earlier – it takes he resulting array of the parse() method, and it passes every HTML string from it to the find() method in the child class. Based on the result, it will end the search for a specific keyword and remove it from the $this->baseurl array, or it will grab another HTML from the parse() result and feed it to find(). It will execute itself while changing the current page of the search until it finds all the keywords, or it hits the limit of the results set in the __constructor() as $limit.
- setup()
All this method does, is to get all the keywords and build the current array. The initial array is build based on the keywords from the constructor. Later in the process, this is done using only the keywords not found for every run of the crawl() method.
- run()
This only starts the crawl() process. One of the few public methods in the class.
- get_results()
Returns the array with the results from the search.
- get_debug_info()
Returns an array with some debug info – in this case, the time it took for certain keyword to be found.
We also have two abstract methods:
- set_baseurl()
The URL for every search engine is different, and so is the syntax of the search terms. In order to make the class more generic, this method should provide a string with two keywords – “keyword” and “position”, which will be later replaced with the actual values in setup() for every specific URL.
- find($html)
Every search engine returns the results differently, so this method takes a generic HTML string, and looks for the specific URL of the site. Sometimes it is better to use Regex, sometimes it’s better to traverse DOM.
Whatever the case, the possible outcome for every search engine result is either the result to be found or not. The result of this method should be FALSE (if nothing found) or the position of the result on the current HTML.
So let’s say that I want to create a SERP Tracker for Google – all I need to do is to extend the abstract class, and pass the two methods – set_baseurl() and find(). The class will do the rest:
<?php class GoogleTracker extends Tracker { function set_baseurl() { // use "keyword" and "position" to mark the position of the variables in the url $baseurl = "http://www.google.com/search?q=keyword&start=position"; return $baseurl; } function find($html) { // process the html and return either a numeric value of the position of the site in the current page or FALSE $dom = new DOMDocument(); @$dom->loadHTML($html); $nodes = $dom->getElementsByTagName('cite'); // found is false by default, we will set it to the position of the site in the results if found $found = FALSE; // start counting the results from the first result in the page $current = 1; foreach($nodes as $node) { $node = $node->nodeValue; // look for links that look like this: cmsreport.com › Blogs › Bryan's blog if(preg_match('/\s/',$node)) { $site = explode(' ',$node); } else { $site = explode('/',$node); } $urls[$current] = $site[0]; if($site[0] == $this->site) { $found = TRUE; $place = $current; } $current++; } if(isset($found) && $found !== FALSE) { return $place; } else { return FALSE; } } }
And you can use it like this:
<?php $test = new GoogleTracker(array('git'), 'www.kernel.org', 50); //$test->use_proxy('proxy.txt'); $test->run(); print_r($test->get_results()); echo "================<br>"; print_r($test->get_debug_info()); ?>
This will look for ‘git’ in the first 50 results in Google, and it will report the position and the time it took to find it. I hope that this article will help you and if you can think of more ways to improve it, please leave a comment, or if you’d like to lend a hand, simply fork my repository below, hack away and contact me when you’d like to merge something.