使用路径和值从xml文件创建数据框 [英] Create a dataframe from a xml file with the paths and the value

查看:40
本文介绍了使用路径和值从xml文件创建数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是xml文件中的数据,

Here is the data from the xml file,

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative></designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope

我想在数据帧中保存:1)路径和2)与该路径相对应的元素的文本,并且仅用于包含值的元素.所以我想要这样的东西:

I want to save in a dataframe : 1) the path and 2) the text of the elements corresponding to the path and only for a the elements that contains a value. So I would like to have something like that :

                                           Path Value
0  Body/ADD_LandIndex_001/CNTROLAREA/BSR/status   ADD
1  Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN  LandIndex
2  Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION   001

我有这个无效的小代码!它返回一个空的数据帧,但是我可以通过函数循环中的 print(d) 看到它正确地获取每个元素.我真的不知道出什么问题了吗?任何人都可以找到为什么它为空并且不能正常工作的原因吗?

I have this little code that does not work ! It returns an empty dataframe, however I can see by the print(d) in the loop of the function that it takes correctly each elements. I don't really see what is wrong ? Anyone can find why it is empty and not working ?

from lxml import etree as et
from collections import defaultdict
import pandas as pd
import os


filename = 'file_try.xml' 
namespace = '{http://schemas.xmlsoap.org/soap/envelope/}'

with open(filename, 'rb') as file: 
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root) 

col_name = ['Path', 'Value']
dataF = pd.DataFrame([],columns = col_name)

def traverse(el,d):
    
    if len(list(el)) > 0:
        for child in el:
            traverse(child,d)

    else:

        if el.text is not None:
            d = d.append({'Path': tree.getelementpath(el).replace(namespace,''), 'Value' : el.text }, ignore_index = True)
            print(d)
            
    return d

df = traverse(root,dataF)
print(df)

df.to_excel("data_2.xlsx") 

推荐答案

尝试一下.

from simplified_scrapy import SimplifiedDoc, utils
rows = []
rows.append(['Path', 'Value'])
xml = utils.getFileContent('file_try.xml')
doc = SimplifiedDoc(xml)
body = doc.select('SOAP-ENV:Body')

def getPathValue(node, path):
    path = path + '/' + node['tag'] # Splicing path
    children = node.children
    if children:
        traverseNodes(children, path)
    else:
        rows.append([path, node.text])

def traverseNodes(nodes, path):
    for node in nodes:  # Traversing child nodes
        getPathValue(node, path)

traverseNodes(body.children, "Body")

# print(rows)
utils.save2csv('data_2.csv', rows)

结果:

[['Body/ADD_LandIndex_001/CNTROLAREA/BSR/status', 'ADD'], ['Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN', 'LandIndex'], ['Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION', '001'], ['Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId', 'AMI100031'], ['Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey', 'R3278458'], 
...

这篇关于使用路径和值从xml文件创建数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆