3 Steps to Master XML Migrations
Every once in a while, a non-standard data migration can come down the pipeline and absorb a few days work. If this has happened to you, you know how crucial formatting and primary keys are a successful migration. Recently, my team and I tackled an XML file migration that was not only large, but lacked primary keys to link up each of the separate elements.
At first glance, processing a 3 million-line file and adding primary keys without developer support seemed like an impossible task. But nothing is impossible my friends, and we found a non-code solution to the problem.
Below are 3 steps to a successful XML migration:
- Use Microsoft Access
To process an XML into CSV format we first need to use Microsoft Access. Access has the capabilities to parse out the tables within an XML file and export them into other file formats.
- First, upload the file directly into Access with the “Data and Structure”. This way we can diagnose any import errors.
- If there are import errors, reconfigure the field types in Design Mode for the culprit table. Here, it can be handy to change everything to “Long Text” so that any values that can be, will be pulled into the database.
- Finally, clear out the tables, and re-import the file with (hopefully) no import errors.
- Transform the XML
The biggest problem arises when you have multiple nodes within the XML file. When Access breaks these out into separate tables, it does not provide a Primary Key or relationship to connect one record to another. We must transform the XML to represent this relationship before we insert it into Access.To do this we use a transformation style sheet called an “xslt” file and use it when importing the data. A simple “xslt” file to add a primary key to a “Comments” table looks like this:Note: This can be simply copy and pasted into a word processor and edited to fit the needs of whatever transform. Save the file with the file extension “.xslt”.
<xsl:stylesheet version=”1.0″ xmlns_xsl=”http://www.w3.org/1999/XSL/Transform”>
<xsl:strip-space elements=”*”/>
<xsl:template match=”@*|node()”>
<xsl:copy>
<xsl:apply-templates select=”@*|node()”/>
</xsl:copy>
</xsl:template>
<xsl:template match=”comment”>
<xsl:copy>
<xsl:apply-templates select=”@*”/>
<xsl:copy-of select=”ancestor::ticket/ticket-id”/>
<xsl:apply-templates select=”node()”/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
While this looks like a lot of code, the foundation is simple and the results can be debugged by checking for errors in the data output. Let’s break it down:
- The first three lines initialize the style sheet and are standard for this type of transform.
- After this, the first xsl:template tag-set tells Access to copy all of the data from each element.
- When the transformation reaches the second xsl:template set, it matches any “comment” elements, copies all of the data, then adds a new field based on the ancestor element. In this case, the ancestor element has a primary key called “ticket-id” that will be pulled into the Comments table.
- Apply the new stylesheet to the original XML file
To apply this new stylesheet to the original XML file we first have to clear out the Access database again. This time, when importing the file, we will use the transform button to select our stylesheet. This processing could take some time, and, if the file is too large, with fail due to insufficient memory. To overcome this additional issue, you can split up the file using any myriad of free XML splitters. I’ve used CodeProject or GitHub.
Then, simply split the file and import the parts into Access using the same stylesheet. Export the resulting table(s) into Excel format and finish doing any other data cleansing or preparation as normal.
If you have any questions, let me know by commenting below or contact OpFocus for help.