BeautifulSoup将XML解析为表格 [英] BeautifulSoup parsing XML to table
问题描述
再次出现另一个问题.使用BeautifulSoup在解析XML方面确实很新,并且从2周开始就存在此问题.将感谢您的帮助具有这种结构:
come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :
<detail>
<page number="01">
<Bloc code="AF" A="000000000002550" B="000000000002550"/>
<Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
<Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
<Bloc code="DA" A="000000000038486" B="000000000038486"/>
<Bloc code="DD" A="000000000003849" B="000000000003849"/>
<Bloc code="EA" A="000000000001029"/>
<Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
<page number="03">
<Bloc code="FD" C="000000000574042" D="000000000610740"/>
<Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>
这是我的代码:(我知道它是如此糟糕,必须对其进行改进:'()
this is my code:(i know that its so poor and have to improve it :'( )
if soup.find_all('bloc') != None:
for element in soup.find_all('bloc'):
code_element = element['code']
if element.find('m1'):
m1_element = element['m1']
else:
None
if element.find('m2'):
m2_element = element['m2']
else:
None
print(code_element,m1_element, m2_element)
我收到了错误消息,因为'm2'元素并不在所有页面中都存在.我不知道该如何处理这个问题.
I ve got the error because the 'm2' element does not exist in all the pages. i dont know how can handle this issue.
我想像这样将结果放入DataFrame中.
i would like to put the result in DataFrame like this.
DatFrame = CODE A/ B/ C/ D Page--- Columns
AF 0000002550 00002550 NULL NULL 01
AH 000035826 NULL 000035826 0000035826 01
AR 000026935 000000024503 0000002431 0000001669 01
....etc.
非常感谢您的帮助
推荐答案
对 bloc 元素的 list
理解,其中嵌入了 dict
bloc 属性是核心.通过附加到 bloc 属性的 dict
来页面,导航到 parent
和必需的属性.
A list
comprehension of bloc elements with an embedded dict
comprehension of bloc attributes is the core. page by appending to dict
of bloc attributes, navigating to parent
and the required attribute.
列顺序基于它们的显示顺序
Column order is based on order that they are seen
from bs4 import BeautifulSoup
xml = """<detail>
<page number="01">
<Bloc code="AF" A="000000000002550" B="000000000002550"/>
<Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
<Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
<Bloc code="DA" A="000000000038486" B="000000000038486"/>
<Bloc code="DD" A="000000000003849" B="000000000003849"/>
<Bloc code="EA" A="000000000001029"/>
<Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
<page number="03">
<Bloc code="FD" C="000000000574042" D="000000000610740"/>
<Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>"""
soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}}
for b in soup.find_all("bloc")])
输出
code a b page c d
AF 000000000002550 000000000002550 01 NaN NaN
AH 000000000035826 NaN 01 000000000035826 000000000035826
AR 000000000026935 000000000024503 01 000000000002431 000000000001669
DA 000000000038486 000000000038486 02 NaN NaN
DD 000000000003849 000000000003849 02 NaN NaN
EA 000000000001029 NaN 02 NaN NaN
EC 000000000063797 000000000082427 02 NaN NaN
FD NaN NaN 03 000000000574042 000000000610740
GW NaN NaN 03 000000000052677 000000000075362
elementtree
与BeautifulSoup非常相似
elementtree
Very similar to BeautifulSoup
import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}}
for p in root.iter("page")
for b in p.iter("Bloc") ])
这篇关于BeautifulSoup将XML解析为表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!