Pandas 读取 xml 无法正常用于单标签 xml [英] Pandas read xml not working properly for single tag xml

查看:31
本文介绍了Pandas 读取 xml 无法正常用于单标签 xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pandas_read_xml 包来读取 xml 文件并将其处理为 Pandas 数据帧.在绝大多数情况下,该软件包完全符合我的目的.但是,当读取只有一个标签的 url 时,数据帧输出有点关闭.让我用以下两个例子来说明这一点.

I am using the pandas_read_xml package for reading and processing xml files into a pandas dataframe. The package works absolutely fine for my purpose in the vast majority of cases. However, the dataframe output is kind of off when reading a url with just a single tag. Let me illustrate this with the following two examples.

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)

生成的 df_1 包含 163 行和 31 列,其中每一行对应一个唯一的证券.这符合我想要的结果.但是,当我尝试读取一个 xml 时,输出有点奇怪,其中只出现了一个标记invstOrSec".

The resulting df_1 contains 163 rows and 31 columns where each row corresponds to a unique security. This is in line with my desired result. However, the output is a little strange when I try to read a xml where there is just one occurrence of the tag 'invstOrSec'.

# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)

生成的 df_2 包含 6 行和 19 列.我真的无法理解为什么它实际上应该是 1 行却包含 6 行.我观察到这种行为只发生在标签invstOrSec"只出现一次的情况下.对此的任何帮助将不胜感激.如果我的问题不清楚,请告诉我.

The resulting df_2 contains 6 rows and 19 columns. I can’t really make sense of why it contains 6 rows when it should really be one. I have observed that this behavior occurs in only those cases where there is just one occurrence of the tag 'invstOrSec'. Any help over this would be greatly appreciated. Please let me know if my question isn't clear.

推荐答案

确实,在即将发布的 Pandas 1.3 中,read_xml 将允许您将解析的节点迁移到数据帧中.但是,因为 XML 可以具有超出按列的 2D 行的许多维度,如下所述:

Indeed, in forthcoming Pandas 1.3, read_xml will allow you to migrate parsed nodes into data frames. However, because XML can have many dimensions beyond the 2D of rows by columns, as noted:

此方法最适合导入浅层 XML 文档

This method is best designed to import shallow XML documents

因此,不会立即选取任何嵌套元素,如此处所示,大约有 20 列.请注意,由于文档中的默认命名空间,需要使用 命名空间.

Therefore, any nested elements are not immediately picked up as shown here with about 20 columns. Notice the required use of namespaces due to the default namespace in document.

熊猫 1.3+

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN              NaN      None     NaN
..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN              NaN      None     NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN              NaN      None     NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN

# [163 rows x 20 columns]


url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                        name  lei                                     title      cusip  ...  invCountry  isRestrictedSec fairValLevel securityLending
# 0  Salient Private Access Master Fund, L.P.  NaN  Salient Private Access Master Fund, L.P.  999999999  ...          US                Y          NaN             NaN

# [1 rows x 18 columns]

幸运的是,read_xml 支持带有默认 lxml 解析器的 XSLT(旨在转换 XML 文档的专用语言).使用 XSLT,您可以将迁移所需的节点展平以检索 32 列.

Fortunately, read_xml supports XSLT (special-purpose language designed to transform XML documents) with default lxml parser. With XSLT, you can then flatten needed nodes for migration to retrieve the 32 columns.

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
                 stylesheet=xsl)
print(df)
#                                                   name  lei                                              title      cusip  ...  annualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...        0.0624          N                     N           N
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           NaN       None                  None        None
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN       None                  None        None
..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN       None                  None        None
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN       None                  None        None
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None

# [163 rows x 32 columns]

熊猫<1.3

要通过 XPath 方法实现相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框.具体来说,从转换、解析的 XML 创建一个字典列表,并传递到 DataFrame 构造函数.下面使用与上面相同的 XSLT 和 XPath 命名空间.

To achieve same result via XPath approach requires more steps where you will have to handle URL request and XML parsing to data frame build. Specifically, create a list of dictionaries from transformed, parsed XML and pass into DataFrame constructor. Below uses same XSLT and XPath with namespace as above.

import lxml.etree as lx
import pandas as pd
import urllib.request as rq

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

content = rq.urlopen(url)

# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)

# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)

# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
        } for inv in result.xpath("//edgar:invstOrSec", 
                                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]

# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)

print(df)
#                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  N/A  Regatta XV Funding Ltd., Subordinated Note, Pr...  ...         N                    N           N
# 2                Hired, Inc., Series C Preferred Stock  N/A              Hired, Inc., Series C Preferred Stock  ...       NaN                  NaN         NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  N/A                    WESTVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  N/A                     VOYAGER CAPITAL FUND III, L.P.  ...       NaN                  NaN         NaN
# ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  ...       NaN                  NaN         NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  N/A                       ALLOY MERCHANT PARTNERS L.P.  ...       NaN                  NaN         NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN

# [163 rows x 32 columns]

这篇关于Pandas 读取 xml 无法正常用于单标签 xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆