I am trying to read in data from a large amount of XML files into a pandas dataframe. There are more than 40k files, with their file size varying between 1MB and 10MB. This is what the file structure looks like:
xml = '''
<ns2:GetProfile xmlns:ns2="http://ift.tt/1RKNso9">
<StatusMessage/>
<StatusCode>0</StatusCode>
<ns2:Profile>
<ns2:stObservation>
<Date>2014-11-25</Date>
<Sequence>18</Sequence>
<Value>226</Value>
<NullStatus>0</NullStatus>
<Peak>true</Peak>
</ns2:stObservation>
<ns2:stObservation>
<Date>2014-01-04</Date>
<Sequence>13</Sequence>
<Value>557</Value>
<NullStatus>0</NullStatus>
<Peak>false</Peak>
</ns2:stObservation>
</ns2:Profile>
</ns2:GetProfile>
'''
Each file represents time series data of a single object. I'm using the objectify function from lxml, like so:
from lxml import objectify, etree
root = objectify.fromstring(xml)
However, there is an unused namespace declared, but not for all nodes. So when trying to acces Date, it throws an error.
print root.Profile.stObservation.Date
#AttributeError: no such child: {http://ift.tt/1HrtxeD
Then, when I remove the namespace, it works.
xml_no_ns2 = '''
<GetProfile>
<StatusMessage/>
<StatusCode>0</StatusCode>
<Profile>
<stObservation>
<Date>2014-11-25</Date>
<Sequence>18</Sequence>
<Value>226</Value>
<NullStatus>0</NullStatus>
<Peak>true</Peak>
</stObservation>
<stObservation>
<Date>2014-01-04</Date>
<Sequence>13</Sequence>
<Value>557</Value>
<NullStatus>0</NullStatus>
<Peak>false</Peak>
</stObservation>
</Profile>
</GetProfile>
'''
root_no_ns2 = objectify.fromstring(xml_no_ns2)
print root_no_ns2.Profile.stObservation.Date
Due to the large amount of files I am not so flexibel for work-arounds. But I am sure there should be a proper solution.
Aucun commentaire:
Enregistrer un commentaire