XmlDiff TUTORIAL
XmlDiff TUTORIAL
1. Contents
  • Link or reference ("synopsis") to an inexistant target. Synopsis
  • Link or reference ("detailed-example") to an inexistant target. Detailed example
  • Link or reference ("warnings") to an inexistant target. Warnings
  • Link or reference ("references") to an inexistant target. References
  • Link or reference ("feedback") to an inexistant target. Feedback
2. Synopsis
xmldiff [Options] from_file to_file xmldiff [Options] [-r] from_directory to_directory Options: -h, --help display this help message and exit. -V, --version display version number and exit -H, --html input files are HTML instead of XML -r, --recursive when comparing directories, recursively compare any subdirectories found. -x, --xupdate display output following the Xupdate xml specification (see http://www.xmldb.org/xupdate/xupdate-wd.html#N19b1de). -e encoding, --encoding=encoding specify the encoding to use for output. Default is UTF-8 -n, --not-normalize-spaces do not normalize spaces and new lines in text and comment nodes. -c, --exclude-comments do not process comment nodes -g, --ext-ges include all external general (text) entities. -p, --ext-pes include all external parameter entities, including the external DTD subset. --profile=file display an execution profile (run slower with this option), profile saved to file (binarie form).
3. Detailed example
if you process two files file1 and file2 which respectively contain:
<memory> <mailbox path="/var/spool/mail/almaster"/> <server-socket port="7776" recipe="pia.PDA"/> <server-socket port="7777" recipe="proxy.Web proxy"/> <email_addr mine="yes">almaster@logilab.org</email_addr> <junkbuster-method value="18" /> <spoken-languages> <language name="italian" code="it" /> <language name="english" code="fr" /> <language name="english" code="en" /> </spoken-languages> </memory>
and
<memory> <box path="/var/spool/mail/almaster"/> <server-socket port="7776" recipe="pia.PDA"/> <server-socket port="7797" recipe="proxy.Web proxy"/> <email_addr mine="yes">syt@logilab.org</email_addr> <junkbuster-method val="18"> <newson/> </junkbuster-method> <spoken-languages new="new attribute"> <language name="english" code="fr" /> <language code="it" name="italian" /> </spoken-languages> <test> <!-- this is an append test --> hoye! </test> </memory>
executing xmldiff file1 file2 will give the following result:
rename_node, /memory[1]/mailbox[1], box] [insert-after, /memory[1]/junkbuster-method[1], <spoken-languages new="new attribute"> <language code="it" name="italian"/> </spoken-languages> ] [insert-after, /memory[1]/spoken-languages[1], <test> <!-- this is an append test --> hoye! </test> ] [update, /memory[1]/email_addr[1]/text()[1], syt@logilab.org] [rename_node, /memory[1]/junkbuster-method[1]@value, val] [append-first, /memory[1]/junkbuster-method[1], <newson/> ] [move-first, /memory[1]/spoken-languages[2]/language[2], /memory[1]/spoken-languages[1]] [update, /memory[1]/server-socket[2]@port, 7797] [remove, /memory[1]/spoken-languages[2]]
This give you a list of primitives to apply on file1 to obtain file2 (you should obtain file2 after the execution of all this script!). See [4] and [5] for more information. The script above tell you the 9 actions to apply on file1:
  • insert after the node /memory/spoken-languages[0] the below xml subtree:
    <test> <!-- this is an append test --> hoye! </test>
  • rename node /memory/mailbox[0] to "box"
  • append a node <newson> to the node /memory[0]/junkbuster-method[0]
  • append an attribute named "new" with value "new attribute" to the node /memory/spoken-languages[0]
  • update attribute /memory/server-socket[1]@port value to "7797"
  • update text /memory/email_addr/text()[0] to "syt@logilab.org"
  • rename attribute /memory/junkbuster-method[0]@value to "val"
  • move the attributes "code" and "name" from /memory[0]/spoken-languages[0]/language[1] to /memory[0]/spoken-languages[0]/language[0] and rename them to LogilabXmldiffTmpAttr:code and LogilabXmldiffTmpAttr:name
  • move the attributes "code" and "name" from /memory[0]/spoken-languages[0]/language[0] to /memory[0]/spoken-languages[0]/language[1] and rename them to LogilabXmldiffTmpAttr:code and LogilabXmldiffTmpAttr:name
  • remove node /memory/spoken-languages/language[2]
  • rename attributes LogilabXmldiffTmpAttr:code and LogilabXmldiffTmpAttr:name of /memory/spoken-languages/language[0] to name and code
  • rename attributes LogilabXmldiffTmpAttr:code and LogilabXmldiffTmpAttr:name of /memory/spoken-languages/language[1] to name and code
Note all xpath are relative to the file1 with previous steps applied.
if you would have typed "xmldiff -x file1 file2", you would have obtained the same thing described as an Xupdate output (see [3]).
<?xml version="1.0"?> <xupdate:modifications version="1.0" xmlns:xupdate="http://www.xmldb.org/xupdate"> <xupdate:rename name="/memory[1]/mailbox[1]" > box </xupdate:rename> <xupdate:insert-after select="/memory[1]/junkbuster-method[1]" > <xupdate:element name="spoken-languages"> <xupdate:attribute name="new"> new attribute </xupdate:attribute> <language code="it" name="italian"/> </xupdate:element> </xupdate:insert-after> <xupdate:insert-after select="/memory[1]/spoken-languages[1]" > <xupdate:element name="test"> <!-- this is an append test --> hoye! </xupdate:element> </xupdate:insert-after> <xupdate:update select="/memory[1]/email_addr[1]/text()[1]" > syt@logilab.org </xupdate:update> <xupdate:rename name="/memory[1]/junkbuster-method[1]@value" > val </xupdate:rename> <xupdate:append select="/memory[1]/junkbuster-method[1]" child="first()" > <xupdate:element name="newson"> </xupdate:element> </xupdate:append> <xupdate:remove select="/memory[1]/spoken-languages[2]/language[2]" /> <xupdate:append select="/memory[1]/spoken-languages[1]" > <xupdate:element name="language"> <xupdate:attribute name="code"> fr </xupdate:attribute> <xupdate:attribute name="name"> english </xupdate:attribute> </xupdate:element> </xupdate:append> <xupdate:update select="/memory[1]/server-socket[2]@port" > 7797 </xupdate:update> <xupdate:remove select="/memory[1]/spoken-languages[2]" /> </xupdate:modifications>
4. Warnings
  • This version of xmldiff doesn't process the DTD, CDATA and PROCESSING INSTRUCTIONS nodes, so if there is a difference between two document in one of those nodes, xmldiff won't see it.
  • Furthermore, xml namespaces are disabled: <xsl:transform xmlns:xsl="..."/> and <xslt:transform xmlns:xslt="..."/> are seen as different nodes
  • Comparing document bigger than 200Ko can take a few minutes (during tests, it took at about 25 seconds to diff two versions of a 130Ko document on a Celeron 533 box with 256Mo RAM)
  • The execution time is scaled to the number of differences between the documents to compare
  • Finally, a few assumptions have been made to obtain the faster algorithm:
    • there is an ordering <_l on the labels in the shema such that a node with a label l1 can appear as the descendent of a node with a label l2 only if l1 <_l l2
    • for any leaf x from T1, there is at most one leaf y from T2 which can be mapped with x (internally, 2 node may be mapped together if their lcs (longest common subsequence) ratio is greater than 0.6)
5. References
  1. "Tree-to-tree correction for document trees" by D.T. Barnard, G. Clarke, N. Duncan Queen's university, Kingston, Ontario K7L 3N6 (Canada), 1995 The "ezs" algorithm
  2. "Change detection in hierarchically structured information" by S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom Stanford University, 1996 The Fast Match / Edit Script algorithm (fmes), used by default
  3. http://www.w3.org/TR/2000/REC-xml-20001006 XML 1.0 W3C recommendation
  4. http://www.w3.org/TR/xpath XML path language 1.0 W3C recommendation
6. Feedback
xmldiff discussion should take place on the xml-logilab mailing list. Please check http://lists.logilab.org/mailman/listinfo/xml-projects for information on subscribing and the mailing list archives.