lundi 29 juin 2015

Read in large amount of XML files with inconsistent namespace

I am trying to read in data from a large amount of XML files into a pandas dataframe. There are more than 40k files, with their file size varying between 1MB and 10MB. This is what the file structure looks like:

xml = '''
<ns2:GetProfile xmlns:ns2="http://ift.tt/1RKNso9">
  <StatusMessage/>
  <StatusCode>0</StatusCode>
  <ns2:Profile>
    <ns2:stObservation>
      <Date>2014-11-25</Date>
      <Sequence>18</Sequence>
      <Value>226</Value>
      <NullStatus>0</NullStatus>
      <Peak>true</Peak>
    </ns2:stObservation>
    <ns2:stObservation>
      <Date>2014-01-04</Date>
      <Sequence>13</Sequence>
      <Value>557</Value>
      <NullStatus>0</NullStatus>
      <Peak>false</Peak>
    </ns2:stObservation>
  </ns2:Profile>
</ns2:GetProfile>
'''

Each file represents time series data of a single object. I'm using the objectify function from lxml, like so:

from lxml import objectify, etree
root = objectify.fromstring(xml)

However, there is an unused namespace declared, but not for all nodes. So when trying to acces Date, it throws an error.

print root.Profile.stObservation.Date 
#AttributeError: no such child: {http://ift.tt/1HrtxeD

Then, when I remove the namespace, it works.

xml_no_ns2 = '''
<GetProfile> 
    <StatusMessage/>
    <StatusCode>0</StatusCode>
    <Profile>
        <stObservation>
            <Date>2014-11-25</Date>
            <Sequence>18</Sequence>
            <Value>226</Value>
            <NullStatus>0</NullStatus>
            <Peak>true</Peak>
        </stObservation>
        <stObservation>
            <Date>2014-01-04</Date>
            <Sequence>13</Sequence>
            <Value>557</Value>
            <NullStatus>0</NullStatus>
            <Peak>false</Peak>
        </stObservation>
    </Profile>
</GetProfile>
'''
root_no_ns2 = objectify.fromstring(xml_no_ns2)
print root_no_ns2.Profile.stObservation.Date 

Due to the large amount of files I am not so flexibel for work-arounds. But I am sure there should be a proper solution.

Aucun commentaire:

Enregistrer un commentaire