Friday 4 May 2007

Transforming data centered XML into SQL statements

A canny data import technique that emerged from praxis, while working on the import of data-centered XML resources, is utilitizing the abilities of Xslt. The generation of the required SQL statements actually only needs a simple Xsl stylesheet which might import for an PHP XSLTProcessor object or pass to the xsltproc command line tool. Both further described approaches are based upon the libxslt™ library and are assuming the use of XSLT 1.0.

We start off with a basic Xsl stylesheet for generating SQL DELETE and INSERT statements for the data import, followed by an example XML import file.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:for-each select="partnercrate/partner">
<xsl:variable name="partnerId"><xsl:value-of select="@id"/></xsl:variable>
<xsl:variable name="partnerName"><xsl:value-of select="@name"/></xsl:variable>
DELETE FROM partner_stores WHERE partner_id = '<xsl:value-of select="$partnerId"/>';
<xsl:for-each select="service">
INSERT INTO partner_stores(id, partner_id, name, url_en, service_id, source, timestamp)
VALUES (0, '<xsl:value-of select="$partnerId"/>', '<xsl:value-of select="$partnerName"/>', ... );
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

<?xml version="1.0" encoding="UTF-8"?>
<partnercrate>
<partner id="100005" name="Fat beat's records">
<website>http://www.fatbeats.com/</website>
<service id="ePCservice:00623">
<url lang="en">http://www.fatbeats.com/rss/fresh-releases</url>
</service>

...

<service id="ePCservice:00274">
<url lang="en">http://www.fatbeats.com/rss/back-instock</url>
<url lang="de"></url>
</service>
</partner>
<partner id="100006" name="Dig a little deeper records">
<website>http://www.digalittledeeper.net/</website>
<service id="ePCservice:00623">
<url lang="en">http://www.digalittledeeper.net/rss/en/new-releases</url>
</service>

...

<service id="ePCservice:00274">
<url lang="en">http://www.digalittledeeper.net/rss/en/back-in</url>
</service>
</partner>
</partnercrate>
Note: The INSERT statement in the above XSL stylesheet has to be an one-liner and is omitted. It's just wrapped and omitted to preserve the page layout.

As indicated there are several ways to apply this Xsl stylesheet and use the transformation results in PHP. One approach is using the XSLTProcessor class of PHP5 as outlined in the next code snippet.
<?php

...

public function __construct() {

$this->_xslFile = 'partnerCrateToSql.xsl';

}

...

/**
* @return string A single SQL query string.
*/
private function transformXmlPartnerImportToSql($xmlFile) {

if(!$this->importFileIsValid($xmlFile)) {

throw new Exception("Invalid import file {$xmlFile} provided");

}

if(false === ($xml = DOMDocument::load($xmlFile))) {

throw new Exception("Unable to load {$xmlFile}");

}

if(false === ($stylesheet = DOMDocument::load($this->_xslFile))) {

throw new Exception("Unable to load {$this->_xslFile}");

}

$processor = new XSLTProcessor;
$processor->importStylesheet($stylesheet);
$processor->registerPHPFunctions();

if(false === ($transformationResult = $processor->transformToXML($xml))) {

throw new Exception("Transformation of {$xmlFile} failed");

}

if(!is_string($transformationResult) || strlen($transformationResult) == 0) {

throw new Exception("Xsl transformation for {$xmlFile} returned no result");

}

// SQL statements for queries against the targeted database
return $transformationResult;

}

...

Another possible option is to rely on available and suitable command line tools to get the job done, according to the Pragmatic Programmer suggestion to "Harness the power of basic tools" from Andrew Hunt and David Thomas. In this case xsltproc comes in very handy to spark off the transformation and to provide the generated SQL statements to the caller. This solution saves some typing, but also builds in a tool and Xslt function dependency.

So here's the code digest showing the xsltproc approach.
<?php

...

public function __construct() {

$this->_xslFile = 'partnerCrateToSql.xsl';

}

...

/**
* @return mixed A collection of SQL queries.
*/
private function transformXmlPartnerImportToSql($xmlFile) {


if(!$this->importFileIsValid($xmlFile)) {

throw new Exception("Invalid import file {$xmlFile} provided");

}

$xsltprocCall = "xsltproc --nonet --novalid $this->_xslFile $xmlFile";
exec(escapeshellcmd($xsltprocCall), $transformationResult, $consoleFeedback);

if($consoleFeedback > 0) {

throw new Exception("Xsltproc failed for {$xmlFile} with error return code {$consoleFeedback}");

}

if(!is_array($transformationResult) || count($transformationResult) == 0) {

throw new Exception("Xsl transformation for {$xmlFile} returned no result");

}

// SQL statements for queries against the targeted database swept against empty values
// due to stylesheet indentation
return array_values(array_filter($transformationResult))

}

...

Both spotted solutions can be applied when the underlying data-centered XML structure is almost fixed, best described and validated through a XML Schema or DTD, and you don't have to build SQL INSERT statements that are heavily dependent from conditionals to resolve within the transfomation process. I guess this is where it gets nasty to maintain and reproduce those conditionals in the XSL stylesheet and a last uncovered option, XML parsing i.e. via SimpleXML, might come into operation.

Furthermore both approaches return the generated SQL in a different data type/format:

+ The XSLTProcessor class approach returns a single SQL string which must be exploded to retrieve the single SQL statements and commited through a execute loop to the database. This is necessary due to the lack of executing a query stack through a single MYSQL_PDO method call. Any comments or hints on this issue are very welcome.

+ The xsltproc approach returns an array already containing all the generated SQL statements, making further processing more comfortable.

Another crux is to quote import values with single or double quotes to avoid the corrosion of the generated SQL statements. This issue is also solved differential, depending on the chosen approach and the supported XSLT version:

+ The XSLTProcessor class approach is using the ability to use PHP functions as XSLT functions. This results in a slightly modification of the Xsl stylesheet and a call of the registerPHPFunctions method on the XSLTProcessor object.

This is the slightly modified Xsl stylesheet for the XSLTProcessor class approach.
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:php="http://php.net/xsl">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:for-each select="partnercrate/partner">
<xsl:variable name="partnerId"><xsl:value-of select="@id"/></xsl:variable>
<xsl:variable name="partnerName"><xsl:value-of select="@name"/></xsl:variable>
DELETE FROM partner_stores WHERE partner_id = '<xsl:value-of select="$partnerId"/>';
<xsl:for-each select="service">
INSERT INTO partner_stores(id, partner_id, name, url_en, service_id, source, timestamp)
VALUES (0, ... , '<xsl:value-of select="php:function('addslashes', string($partnerName))', ... );
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
+ The xsltproc approach makes use of the EXSLT, an extension subset for XSTL, strings module. To make the string XSLT functions available you have to get the string module package and modify the Xsl stylesheet.

This is the modified Xsl stylesheet for the xsltproc approach.
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
extension-element-prefixes="str">
<xsl:import href="str.xsl" />
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:call-template name="str:replace">
<xsl:with-param name="string" select="string" />
<xsl:with-param name="search" select="object" />
<xsl:with-param name="replace" select="object" />
</xsl:call-template>

<xsl:for-each select="partnercrate/partner">
<xsl:variable name="partnerId"><xsl:value-of select="@id"/></xsl:variable>
<xsl:variable name="partnerName"><xsl:value-of select="@name"/></xsl:variable>
<xsl:variable name="apos">'</xsl:variable>
<xsl:variable name="aposQuoted">\'</xsl:variable>
DELETE FROM partner_stores WHERE partner_id = '<xsl:value-of select="$partnerId"/>';
<xsl:for-each select="service">
INSERT INTO partner_stores(id, partner_id, name, url_en, service_id, source, timestamp)
VALUES ( ... , '<xsl:value-of select="str:replace($partnerName, $apos , $aposQuoted)"/>', ... );
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
+ By the time XSLT 2.0 is more widely supported, it will be possible to use a purely XSLT approach by using the new string functions i.e. fn:replace to achieve quoting or the like.

Regarding boosting and profiling larger dimensioned Xslt transformations, as in the used example, I'd like to suggest the posts on the Liip Blog.

3 comments:

Ambush Commander said...

Be sure to use mysql_real_escape_string() rather than addslashes: you may want to use a custom PHP function so you can easily swap things out based on your DB.

Anonymous said...

Hey, Raphael..

Just came in to thank you for this little experiment.. I had trouble with the XSLT php callback functions .. apparently, I needed to put php-namespace in the declaration..


thanks, man..

Anonymous said...

hey Raphael,

thank you for the informative article. It has provided me with a very clear guideline.
nice work :)