Pandas 读取 xml 无法正常用于单标签 xml [英] Pandas read xml not working properly for single tag xml
问题描述
我正在使用 pandas_read_xml 包来读取 xml 文件并将其处理为 Pandas 数据帧.在绝大多数情况下,该软件包完全符合我的目的.但是,当读取只有一个标签的 url 时,数据帧输出有点关闭.让我用以下两个例子来说明这一点.
I am using the pandas_read_xml package for reading and processing xml files into a pandas dataframe. The package works absolutely fine for my purpose in the vast majority of cases. However, the dataframe output is kind of off when reading a url with just a single tag. Let me illustrate this with the following two examples.
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
生成的 df_1 包含 163 行和 31 列,其中每一行对应一个唯一的证券.这符合我想要的结果.但是,当我尝试读取一个 xml 时,输出有点奇怪,其中只出现了一个标记invstOrSec".
The resulting df_1 contains 163 rows and 31 columns where each row corresponds to a unique security. This is in line with my desired result. However, the output is a little strange when I try to read a xml where there is just one occurrence of the tag 'invstOrSec'.
# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)
生成的 df_2 包含 6 行和 19 列.我真的无法理解为什么它实际上应该是 1 行却包含 6 行.我观察到这种行为只发生在标签invstOrSec"只出现一次的情况下.对此的任何帮助将不胜感激.如果我的问题不清楚,请告诉我.
The resulting df_2 contains 6 rows and 19 columns. I can’t really make sense of why it contains 6 rows when it should really be one. I have observed that this behavior occurs in only those cases where there is just one occurrence of the tag 'invstOrSec'. Any help over this would be greatly appreciated. Please let me know if my question isn't clear.
推荐答案
确实,在即将发布的 Pandas 1.3 中,read_xml
将允许您将解析的节点迁移到数据帧中.但是,因为 XML 可以具有超出按列的 2D 行的许多维度,如下所述:
Indeed, in forthcoming Pandas 1.3, read_xml
will allow you to migrate parsed nodes into data frames. However, because XML can have many dimensions beyond the 2D of rows by columns, as noted:
此方法最适合导入浅层 XML 文档
This method is best designed to import shallow XML documents
因此,不会立即选取任何嵌套元素,如此处所示,大约有 20 列.请注意,由于文档中的默认命名空间,需要使用 命名空间
.
Therefore, any nested elements are not immediately picked up as shown here with about 20 columns. Notice the required use of namespaces
due to the default namespace in document.
熊猫 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
幸运的是,read_xml
支持带有默认 lxml
解析器的 XSLT(旨在转换 XML 文档的专用语言).使用 XSLT,您可以将迁移所需的节点展平以检索 32 列.
Fortunately, read_xml
supports XSLT (special-purpose language designed to transform XML documents) with default lxml
parser. With XSLT, you can then flatten needed nodes for migration to retrieve the 32 columns.
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
熊猫<1.3
要通过 XPath 方法实现相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框.具体来说,从转换、解析的 XML 创建一个字典列表,并传递到 DataFrame
构造函数.下面使用与上面相同的 XSLT 和 XPath 命名空间.
To achieve same result via XPath approach requires more steps where you will have to handle URL request and XML parsing to data frame build. Specifically, create a list of dictionaries from transformed, parsed XML and pass into DataFrame
constructor. Below uses same XSLT and XPath with namespace as above.
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]
这篇关于Pandas 读取 xml 无法正常用于单标签 xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!