MSXML, It's Not Just for VB Programmers Anymore

My co-workers cringe when I tell them the truth. What XML parser are you using? MSXML? With Perl? You’ve gotta be crazy.

Yes, it’s true, but I couldn’t help myself. After test driving MSXML in a Visual Basic application, it begged the question: “I wonder if Perl can use MSXML?”

I have been using MSXML to do my XML parsing in Perl and the truth is that Perl is excellent for working with the Microsoft’s MSXML Parser on the Win32 platform. If you use Perl on Win32, give MSXML a try from the comfort of your favorite text editor.

Grab the MSXML Parser

You Grab It

Go to Microsoft’s MSDN site to download the latest version of MSXML, which is the 3.0 Release. Run the installation program and restart your machine. You have installed the latest version in side-by-side mode. None of your other Microsoft applications that use previous versions of MSXML will be affected.

Now Let Perl at It

Perl can control the MSXML parser using OLE. As with almost everything Perl, the difficult part has been done for us. The “kind people at Hip and ActiveWare(ActiveState)” have already provided us with the Win32::OLE module. The only thing that we need to know is the progID for the MSXML parser. A progID is a string used to uniquely identify an OLE automation class in the Windows registry. MSXML offers version dependent and version independent progIDs depending on the method of the installation. Since we have installed MSXML in side-by-side mode, we will need to use the version dependent progID.

Creating an OLE Instance of MSXML.DOMDocument

I begin by using the Win32::OLE module.

use Win32::OLE qw(in with);  # make sure you include(in & with)!!
                             # we will need them later.

Now I am ready to use OLE to create an instance of the MSXML parser or, more correctly, an OLE instance of MSXML2.DOMDocument.3.0, which I will simply call a DOMDocument.

# Version dependent method - this is what we want -
  my $DOM_document = Win32::OLE->new('MSXML2.DOMDocument.3.0') 
    or die "couldn't create";

# Version independent method - Assumes MSXML was installed in Replace Mode
# if you get errors with the above example - try using this example.
  my $DOM_document = Win32::OLE->new('MSXML2.DOMDocument') 
    or die "couldn't create";

Parsing the XML

Since I am a swim coach, I keep all kind of records, times and scores on hand. One of my favorite things to track is records, so I maintain an XML document that contains the school’s top 10 times for each event. Below is what toptimes.xml looks like:

<TOP_TEN_TIMES>
   <EVENT NAME="200 Freestyle">
      <SWIMMER NUMBER="1" TIME="1:51.49" DATE="2/21/98" NAME="Chris Miller"/>
      <SWIMMER NUMBER="2" TIME="1:54.19" DATE="2/17/01" NAME="Peter Myers"/>
 ...
      <SWIMMER NUMBER="10" TIME="2:19.31" DATE="12/8/00" NAME="Andrew Johnson"/>
   </EVENT>
   <EVENT NAME="200 IM">
 ... 
   </EVENT>
   ...
</TOP_TEN_TIMES>

My backstrokers and butterfliers don’t like XML very much, so I want to parse the XML document and print out the top 10 times for the 100 backstroke and 100 butterfly. I begin by loading toptimes.xml using the DOMDocument object that I have already created. The load method is where the XML document is actually parsed into its respective pieces such as Nodes and NodeLists. Validation also occurs at this point. The Load method returns a boolean that I can use to test whether my document loaded properly. I am going to validate my document, so I will set the validateOnParse property to ‘True’.

 $DOM_document->{async} = "False";           # disable asynchrous
 $DOM_document->{validateOnParse} = "True";  # validate
 my $boolean_Load = $DOM_document->Load("topten.xml");
 if (!$boolean_Load) 
 {
   die "topten.xml did not load";
 }

Iterating Through the XML Document

Now that I have successfully loaded the document, I need a method of iterating through all of the document Nodes. In order to iterate through the document, I first need to find the root Node. In this example, the root Node is <TOP_TEN_TIMES>, so I will define $Top_Ten_Times to be the root Node of the xml document as such:

my $Top_Ten_Times = $DOM_document->DocumentElement();  # assign the root node

Next, I want to find all of the child Nodes of <TOP_TEN_TIMES>. $Events will be all of the root’s child Nodes. In this example, $Events refers to every <EVENT> Node.

my $Events = $Top_Ten_Times->childNodes();      # all of the root's child nodes

$Events is now an NodeList (which is an OLE collection object) that I can use to iterate through each <EVENT> node in the XML document. Veteran Perl programmers will recognize the iteration code as being very similar to iterating through the elements of an array. The only difference is the little keyword 'in' that I mentioned earlier when we used Win32::OLE. The keyword 'in' is used to distinguish an OLE collection object from a standard Perl array.

I now iterate over each <EVENT> Node in the document checking each time to see whether I have one of the events that I need. When I arrive at one of the desired events, I will print the NAME Attribute of the current <EVENT> Node and create a new NodeList called $Swimmers. I will then iterate over each <SWIMMER> Node and print the TIME Attribute.

foreach my $Event (in $Events) # make sure you include the 'in'
{
   if ( ($Event->Attributes->getNamedItem("NAME")->{Text} eq "100 Backstroke") ||
        ($Event->Attributes->getNamedItem("NAME")->{Text} eq "100 Butterfly") )
   {
       # print the event name stored in the NAME attribute
        print $Event->Attributes->getNamedItem("NAME")->{Text}, "\n"; 
        my $Swimmers = $Event->childNodes();       # $Swimmers is now a NodeList collection
        foreach my $Swimmer (in $Swimmers )        # iterate through all swimmers
        {
           print $Swimmer->Attributes->getNamedItem("TIME")->{Text}, "\n";  # print the time
        }
   }
}

Transforming the XML

Now that I have satisfied the butterfliers and backstrokers on my team, I am beginning to realize that the design of my XML syntax is less than desirable. Most of my actual data is stored as attribute data, and I would really like it to be element data. I am going to perform a transformation that will place all of my actual data into element data. My goal is to make the XML document look like the following.

<TOP_TEN_TIMES>
   <EVENT>
      <EVENT_NAME>200 Freestyle</EVENT_NAME>
      <SWIM>
         <SWIMMER>Chris Miller</SWIMMER>
         <TIME>1:51.49</TIME>
         <DATE>2/21/98</DATE>
      </SWIM>
      <SWIM>
         <SWIMMER>Peter Myers</SWIMMER>
         <TIME>1:54.19</TIME>
         <DATE>2/17/01</DATE>
      </SWIM>
      ...
   </EVENT> 
   <EVENT>
      ...
   </EVENT>
   ...
</TOP_TEN_TIMES>

After a little work, I come up with the following stylesheet to do the transformation.

<?xml version="1.0" encoding="ISO-8859-1"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" encoding="ISO-8859-1"/>
<xsl:template match="/">
<TOP_TEN_TIMES>
<xsl:for-each select="TOP_TEN_TIMES">
  <xsl:for-each select="EVENT">
  <EVENT>
     <EVENT_NAME><xsl:value-of select="@NAME"/></EVENT_NAME>
     <xsl:for-each select="SWIMMER">
     <SWIM>
        <SWIMMER><xsl:value-of select="@NAME"/></SWIMMER>
        <TIME><xsl:value-of select="@TIME"/></TIME>
        <DATE><xsl:value-of select="@DATE"/></DATE>
     </SWIM>
     </xsl:for-each>
  </EVENT>
  </xsl:for-each>
</xsl:for-each>
</TOP_TEN_TIMES>
</xsl:template>
</xsl:stylesheet>

To perform the transformation, I will need to create three OLE instances of DOMDocument. The first instance loads the top-times document and the second instance loads the stylesheet from above. The third instance will be used as the result of the transformation. I have created a subroutine that uses what we have already covered to create the three DOMDocument instances, and to load the top-times and stylesheet documents.

sub Transform {
 # Assign the File Names
  my $xml_doc_file     = shift;
  my $stylesheet_file  = shift;
  my $new_xml_doc_file = shift;
  my $boolean_Load;

 # Create the three OLE DOM instances
  my $doc_to_transform = Win32::OLE->new('MSXML2.DOMDocument.3.0');  
  my $style_sheet_doc  = Win32::OLE->new('MSXML2.DOMDocument.3.0');
  my $transformed_doc  = Win32::OLE->new('MSXML2.DOMDocument.3.0');

 # Load the Top Times document - just like above
  $doc_to_transform->{async} = "False";
  $doc_to_transform->{validateOnParse} = "True";
  $boolean_Load = $doc_to_transform->Load("$xml_doc_file");
  if(!$boolean_Load)
  {
      die "The Top Times did not load\n";
  }

 # Load the Stylesheet - just like above
  $style_sheet_doc->{async} = "False";
  $style_sheet_doc->{validateOnParse} = "True";
  $boolean_Load = $style_sheet_doc->Load($stylesheet_file);
  if(!$boolean_Load)
  {
      die "The Stylesheet did not load\n";
  }

 #Perform the transformation and save the resulting DOM object
  $doc_to_transform->transformNodeToObject($style_sheet_doc, $transformed_doc);
  $transformed_doc->save("$new_xml_doc_file");
}

The transformNodeToObject method is where the magic happens. I use the top-times DOMDocument instance to invoke the transformNodeToObject method and I pass the stylesheet and transformation-result instances as arguments. After the method returns, the result of the transformation is stored in $transformed_doc, which is strictly in memory. We simply call the save method and write the XML document to disk.

Now we can perform transformations using any stylesheet or XML document that we want (as long as the stylesheet relates to the XML document). We just need to pass the subroutine three file names: the name of the document to transform, the name of the stylesheet and the name of the new document. For our example, I will call the subroutine like the following.

Transform("toptimes.xml", "toptimes.xsl", "newtoptimes.xml");

After this code has executed, I have a brand new XML document newtoptimes.xml that conforms to my new XML syntax.

Updating the XML Document

After all this great work, one of my butterfliers informs me that I have been spelling his name wrong all season. I guess that there is two ‘e’s in Myers, not one. No problem. I can do this easily enough.

I can’t use the exact code from above because the document structure has changed. Since the swimmers’ names are pretty deep in the new structure, it will be too painful to find the root node and create a slew of nested loops (not to mention expensive to the processor). This is exactly what XPath is for. I will create a query to find all occurrences of “Peter Myers” and change them to “Peter Meyers”. Once again, I create an instance of DOMDocument and load the XML document. However, this time I will call a new method, the selectNodes method, directly against the DOMDocument. As an argument, I supply an XPath query. The method returns a NodeList of all the Nodes that matched the XPath query. I can then iterate through the NodeList just like above and update the element data as I go.

my $Peter_Nodes = 
     $new_DOM_document->selectNodes("TOP_TEN_TIMES/EVENT/SWIM/SWIMMER[. = \"Peter Myers\"]");
foreach my $Peter (in $Peter_Nodes)
{
   $Peter->{nodeTypedValue} = "Peter Meyers"  # update the Value
}
$new_DOM_document->save("newertoptimes.xml"); # save the changes

Conclusion

MSXML isn’t just for Visual Basic and Visual C++ programmers. The Win32::OLE module allows Perl programmers to take advantage of Microsoft’s XML parser from the comfort of their favorite text editor, and now that everyone on the team is happy with the top ten times, I can put away my XML Parser until next season … .

Resources

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub