Thursday, 16 October 2008

Scraping websites with Zend_Dom_Query

Today I stumbled upon an interesting and reportable scenario where I had to extract information of the weekly published Drum and Bass charts provided by BBC 1Xtra. As this information currently isn't available in any consumer friendly format like for example a RSS feed, I had to go that scraping route but didn't want to hustle with a regex approach. Since version 1.6.0 the Zend_Dom_Query component has been added to the framework mainly to support functional testing of MVC applications, but it also can be used for rolling custom website scrapers in a snap. Woot, perfect match!

The following code snippets are showing the Bbc_DnbCharts_Scraper class I came up with and an example of its usage. The class utilizes curl to read the website holding the desired data, which will be passed to Zend_Dom_Query to execute queries upon it. For querying the former loaded XHTML Document Object Model it's possible to either utilize XPath or CSS selectors. So I had to pick my poison, and decided to go with the CSS selectors as them were best suited for the document to query and will be more familiar to most jQuery or Prototype users. The query returns a result set of all matching DOMElements which are further unpuzzled via a private helper method returning just the desired charts data as shown in the closing listing. As you can see the implementation of the scraping can be done with a minimum of effort and these are exactly the moments I love the Zend Framework for.

<?php
require_once('Zend/Dom/Query.php');
/**
* 'Class-level' PHPDoc Block
*/
class Bbc_DnbCharts_Scraper
{
private $_url = null;
private $_xhtml = null;

/**
* @param string $url
*/
public function __construct($url)
{
$this->_url = $url;
}
/**
* Scrapes off the drum and bass charts content from the BBC 1Xtra website.
*
* @return array
* @throws Exception
*/
public function scrape()
{
try {
$dom = new Zend_Dom_Query($this->_getXhtml());
} catch (Exception $e) {
throw $e;
}
$results = $dom->query('div.chart div');
$chartDetails = array();
foreach ($results as $index => $result) {
/* @var $result DOMElement */
if ($result->nodeValue !== '') { //filter out <br /> element
$chartDetails[] = $result->nodeValue;
}
}
return $this->_unpuzzleChartDetails($chartDetails, true);
}
/**
* Unpuzzles the chart details and groups them by their chart position,
* if desired with associative keys.
*
* @param array $details
* @param boolean $associative
* @return array
*/
private function _unpuzzleChartDetails(array $details, $associative = false)
{
if (0 === count($details)) {
return array();
} else {
$nextChartRank = 2;
$charts = array();
$groupedChartDetails = array();

foreach ($details as $index => $chartDetail)
{
if ($index <= $nextChartRank) {
$groupedChartDetails[] = $chartDetail;
}
if ($index == $nextChartRank) {
$nextChartRank+=3;
$charts[] = $groupedChartDetails;
unset($groupedChartDetails);
}
}
if ($associative) {
$associatives = array('artist', 'tune', 'label');
foreach ($charts as $chartsIndex => $chart) {
unset($charts[$chartsIndex]);
foreach ($chart as $chartIndex => $chartDetails) {
$charts[$chartsIndex][$associatives[$chartIndex]] =
$chartDetails;
}
}
}
return $charts;
}
}
/**
* Gets the XHTML document via curl
*
* @return string
* @throws Exception
*/
private function _getXhtml()
{
$curl = curl_init();
if (!$curl) {
throw new Exception('Unable to init curl. ' . curl_error($curl));
}
curl_setopt($curl, CURLOPT_URL, $this->_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
// Faking user agent
curl_setopt($curl, CURLOPT_USERAGENT,
'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
$xhtml = curl_exec($curl);
if (!$xhtml) {
throw new Exception('Unable to read XHTML. ' . curl_error($curl));
}
curl_close($curl);
return $xhtml;
}
}

// Usage demo
$scraper = new Bbc_DnbCharts_Scraper('http://www.bbc.co.uk/1xtra/drumbass/chart/');
$charts = $scraper->scrape();
The closing code snippet shows an extract of the Drum and Bass charts from BBC 1Xtra scraped off around the 16th October 2008.
Array
(
[0] => Array
(
[artist] => Chase & Status Ft Plan B
[tune] => Pieces
[label] => Ram Records
)

...

[9] => Array
(
[artist] => Zen
[tune] => Full Effect
[label] => Flipmode Audio
)

)

16 comments:

Toby said...

Why not make it the easy way and use XPath?

Raphael Stolt said...

I chose CSS selectors due to the habit of using the jQuery or Prototype library on a frequent basis, and cuz I was too lazy to re-look upon XPath for the described purpose; but you are right there's also a XPath way to go.

Wil Sinclair said...

I know of at least one other person who is using Zend_Dom_Query for screen scraping. It seems to be a great option of last resort for getting data from a website. BTW, BBC is standardizing on Zend Framework for their front-end code. Some of the sites have already been ported to ZF. I wonder if you were using Zend_Dom to scrape a site rendered by ZF. :D

,Wil

Raphael Stolt said...

Yeah I know, think that Federico Cargnelutti blogged about it a while ago. I remember even asking them to provide the charts in a RSS feed, but guess wasn't a value-adder to get implemented; so I had to go the 'hard' way.

Dougal said...

I would suggest trying this project; http://simplehtmldom.sourceforge.net/

I done a quick write up of it here;
http://blog.dougalmatthews.com/2008/08/html-dom-and-easy-screen-scraping-in-php/

It's amazing. jQuery like DOM selectors!

Raphael Stolt said...

Looks like another interesting and considerable approach, but I guess as a Zend Framework 'feen' I will stick to the described approach; but kudos for sharing.

Nevan said...

Thanks for this code. I was having problems with curl downloading the BBC page, so I saved a local copy of the page and queried my own localhost.

$scraper = new Bbc_DnbCharts_Scraper('http://localhost/php/zend/dnb.html');

I later found that if I added an extra curl_opt, the BBC page came through OK:

curl_setopt($curl, CURLOPT_HEADER, 0);

Thanks again for this useful code, I always use PHP with regex to scrape pages, I'll try this approach next time.

Fuller said...

I have implemented a web scraper as a Firefox extension because I want to make full use of the power of the browser. XSLT, XPath, DOM and Mozilla specific technologies are used to transform Web pages and to extract data snippets.

There are some articles at the site:
http://www.gooseeker.com/en/node/product/front

Anonymous said...

Thanks for the nice tutorial.
The results are in plain text, which is fine but i also need to preserve some html tags from the website, for example the and tags
Can you tell me how to achieve that? is it possible at all? I only need to keep parts of xthml and reformat it. Maybe in that case regex are more appropriate? Thank you

Raphael Stolt said...

Thanks for stopping by and the kudos.

Sorry for the late response, I've quickly tried to get a specific HTML part with Zend_Dom_Query, but I couldn't get it to work(yet). If Ruby is an option for you have a look at Hpricot which provides this feature; otherwise I guess you've to take that 'old' regex approach.

Elazar said...

As a ZF 'feen,' I'm a bit surprised you resorted to using cURL rather than Zend_Http_Client. :P I find the latter a lot more pleasant to use than the former. Good post though.

Raphael Stolt said...

Hi Elazar,

You're right the use of Zend_Http_Client would be a consistent approach for a Zend Framework feen; but as I wanted to put the spotlight/focus on Zend_Dom_Query I can live with it.

Anonymous said...

Interesting points on web scrapers, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for other projects that include documents, the web, or files i tried "web scraper" which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs

Term Papers said...

I have been visiting various blogs for my term papers writing research. I have found your blog to be quite useful. Keep updating your blog with valuable information... Regards

pancy1 said...

Hi there, I have the approximately the same task and I was wondering where to put the above code you wrote, in a Controller? Model? View?

I am new in Zend Framework...

Raphael Stolt said...

Hi pancy1,

I would put in the model or better said a Service providing the scraping functionality; definitely not the view or controller.