BeautifulSoup将XML解析为表格 [英] BeautifulSoup parsing XML to table

查看:66
本文介绍了BeautifulSoup将XML解析为表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

再次出现另一个问题.使用BeautifulSoup在解析XML方面确实很新,并且从2周开始就存在此问题.将感谢您的帮助具有这种结构:

come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :

<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>

这是我的代码:(我知道它是如此糟糕,必须对其进行改进:'()

this is my code:(i know that its so poor and have to improve it :'( )

if soup.find_all('bloc') != None:
for element in soup.find_all('bloc'):
    code_element = element['code']        
    if element.find('m1'):
        m1_element  = element['m1']
    else:
        None
    if element.find('m2'):
        m2_element  = element['m2']
    else:
        None
    print(code_element,m1_element, m2_element)

我收到了错误消息,因为'm2'元素并不在所有页面中都存在.我不知道该如何处理这个问题.

I ve got the error because the 'm2' element does not exist in all the pages. i dont know how can handle this issue.

我想像这样将结果放入DataFrame中.

i would like to put the result in DataFrame like this.

DatFrame = CODE     A/          B/           C/             D            Page--- Columns
           AF       0000002550  00002550     NULL           NULL         01
           AH       000035826   NULL         000035826      0000035826   01
           AR       000026935   000000024503 0000002431     0000001669   01
....etc.

非常感谢您的帮助

推荐答案

bloc 元素的 list 理解,其中嵌入了 dict bloc 属性是核心.通过附加到 bloc 属性的 dict 页面,导航到 parent 和必需的属性.

A list comprehension of bloc elements with an embedded dict comprehension of bloc attributes is the core. page by appending to dict of bloc attributes, navigating to parent and the required attribute.

列顺序基于它们的显示顺序

Column order is based on order that they are seen

from bs4 import BeautifulSoup
xml = """<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>"""

soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}} 
                   for b in soup.find_all("bloc")])


输出

code               a               b page               c               d
  AF 000000000002550 000000000002550   01             NaN             NaN
  AH 000000000035826             NaN   01 000000000035826 000000000035826
  AR 000000000026935 000000000024503   01 000000000002431 000000000001669
  DA 000000000038486 000000000038486   02             NaN             NaN
  DD 000000000003849 000000000003849   02             NaN             NaN
  EA 000000000001029             NaN   02             NaN             NaN
  EC 000000000063797 000000000082427   02             NaN             NaN
  FD             NaN             NaN   03 000000000574042 000000000610740
  GW             NaN             NaN   03 000000000052677 000000000075362

elementtree

与BeautifulSoup非常相似

elementtree

Very similar to BeautifulSoup

import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}} 
                    for p in root.iter("page") 
                    for b in p.iter("Bloc") ])

这篇关于BeautifulSoup将XML解析为表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆