dimanche 28 juin 2015

XML files badly formatted. I have to correct them

I have to parse many documents xml like this:

<doc id=lk-20130223040102_592>
<meta-info>
<tag name="date">2013-02-22</tag>
<tag name="source-encoding">ISO-8859-1</tag>
</meta-info>
<text><SE><E type="E:PERSON">Tom Taylor</E>, who runs <E type="E:ORGANIZATION:CORPORATION">MF&B Marine Warehouse</E> in <E type="E:LOCATION:OTHER">Hampton Roads</E>, is already watching contracts with the <E type="E:ORGANIZATION:GOVERNMENT">Navy</E> <E type="E:PER_DESC">dry</E> up at his small ship-repair <E type="E:ORG_DESC:CORPORATION">business</E>.</SE>
</text></doc>
<doc ...</doc>

I made a simple script to parse one of these:

<?php
$xml=simplexml_load_file('wp7-lk-20130223040102.xml');
foreach ($xml->doc as $doc){
    echo $doc['id'];
    echo "<br>";
}
?>

but it will return a set of warning like this:

Warning: simplexml_load_file(): ^ in C:\wamp\www\parse_xml.php on line 6

I noticed some errors (id = ... rather than id = "...") (parent element is missing) and I corrected what I could, but there are also many others.

Is there any function to help me to correct errors automatically xml?

Aucun commentaire:

Enregistrer un commentaire