读取R中的多个xml文件并合并数据 [英] Read multiple xml files in R and combine the data

查看:173
本文介绍了读取R中的多个xml文件并合并数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件夹,其中包含超过1000个带有扩展名的文件(尽管它们不是 real xml文件).

I have a folder containing more than 1000 files with the extension (they are no real xml files though).

我想从这些文件中自动提取某些内容,以使矩阵或表格成为最终结果(我可以在R中进一步使用它进行分析,或者导出到1个csv文件等).

I want to extract certain contents from these files automatically, so that a matrix or table is the end result (which I can use further in R for analysis, or export to 1 csv file, etc).

我已经制作/更改了适用于单个文件的代码,但无法使其自动工作以完成其余文件.循环吗?

I have made/altered a code which works for a single file, but can't get it to work to do it automatically for the rest. By a loop?

所以我的单个文件代码如下:

So my code for a single file is as follows:

library(xml2)

temp <- read_xml("test.xml")
# get all the <ns2:opendataField>s
recs <- xml_find_all(temp, "//ns2:opendataField")
# extract and clean all the columns
vals <- trimws(xml_text(recs))
#create columns
cols <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "key")
#create rows
rows <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "value")
datakvk <- data.frame(cols,rows)

结果是:

 > head(datakvk)
                                              cols       rows
1                                  SbiBusinessCode      18129
2                             DocumentAdoptionDate 2017-08-22
3                                    FinancialYear       2016
4                                     BalanceSheet       <NA>
5 BalanceSheetBeforeAfterAppropriationResultsTitle       <NA>
6      BalanceSheetBeforeAfterAppropriationResults         Na
> 

最后,我希望拥有所有这1000个文件,

In the end, with all these 1000s of files, I hope to get something like:

                                              cols       file 1   file 2
1                                  SbiBusinessCode      18129     34234
2                             DocumentAdoptionDate 2017-08-22     452454
3                                    FinancialYear       2016     2016
4                                     BalanceSheet       <NA>     2016
5 BalanceSheetBeforeAfterAppropriationResultsTitle       <NA>     <NA>
6      BalanceSheetBeforeAfterAppropriationResults         Na
> 

我尝试了以下代码,但是没有用:

I tried the following code, but it didnt work:

list.files(pattern=".xml$") #

# create a list from these files
list.filenames<-list.files(pattern=".xml$")

# create an empty list that will serve as a container to receive the incoming files
list.data<-list()

# create a loop to read in your data
for (i in 1:length(list.filenames))
{
  list.data[[i]]<-read_xml(list.filenames[i])
  recs <- xml_find_all(list.data[[i]], "//ns2:opendataField")
  vals <- trimws(xml_text(recs))
  cols <- xml_attr(xml_find_all(list.data[[i]], "//ns2:opendataField"), "value")
  rows <- xml_attr(xml_find_all(list.data[[i]], "//ns2:opendataField"), "key")
}

# add the names of  data to the list
names(list.data)<-list.filenames

我想念什么?我哪里出问题了?

What am I missing? where do I go wrong?

预先感谢您的帮助....

Thanks in advance for helping me....

要完整:(一个源文件(千个文件中的一个看起来像:)

To be complete: (One single source file (out of 1000s looks like:)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<opendata xmlns:ns2="http://schemas.kvk.nl/xb/query/service/2016/1/0/0">
  <ns2:opendataField key="SbiBusinessCode" value="18129"/>
  <ns2:opendataField key="DocumentAdoptionDate" value="2017-08-22"/>
  <ns2:opendataField key="FinancialYear" value="2016"/>
  <ns2:opendataField key="BalanceSheet">
    <ns2:opendataField key="BalanceSheetBeforeAfterAppropriationResultsTitle">
      <ns2:opendataField key="BalanceSheetBeforeAfterAppropriationResults" value="Na"/>
    </ns2:opendataField>
    <ns2:opendataField key="BalanceSheetTitle">
      <ns2:opendataField key="Assets" value="61296">
        <ns2:opendataField key="AssetsNoncurrent" value="8978">
          <ns2:opendataField key="IntangibleAssets" value="8978"/>
        </ns2:opendataField>
        <ns2:opendataField key="AssetsCurrent" value="52318">
          <ns2:opendataField key="Inventories" value="2239"/>
          <ns2:opendataField key="Receivables" value="40560"/>
          <ns2:opendataField key="CashAndCashEquivalents" value="9519"/>
        </ns2:opendataField>
      </ns2:opendataField>
      <ns2:opendataField key="EquityAndLiabilities" value="61296">
        <ns2:opendataField key="Equity" value="201">
          <ns2:opendataField key="ShareCapital" value="1"/>
          <ns2:opendataField key="ReservesOther" value="200"/>
        </ns2:opendataField>
        <ns2:opendataField key="LiabilitiesCurrent" value="61095"/>
      </ns2:opendataField>
    </ns2:opendataField>
  </ns2:opendataField>
</opendata>

推荐答案

考虑将您的for循环转换为lapply,该循环调用data.frame()获取数据帧列表.并且,因为您的XML文件可能具有不同的键/值,所以简单地从数据框列表中删除cbind将不起作用,因此请与Reduce()使用链合并,保留所有行(即完整的外部联接).

Consider converting your for loop into lapply that calls data.frame() for list of dataframes. And because your XML files can potentially have different key/values, a simple cbind off a list of dataframes will not work, so use the chain merge with Reduce(), keeping all rows (i.e., full outer join) .

...
# BUILD DATAFRAME LIST
df_list <- lapply(list.filenames, function(f) {
  doc <- read_xml(f)

  setNames(data.frame(
    xml_attr(xml_find_all(doc, "//ns2:opendataField"), "key"),
    xml_attr(xml_find_all(doc, "//ns2:opendataField"), "value")
  ), c("key", f))

})

# CHAIN MERGE INTO MASTER DATAFRAME
final_df <- Reduce(function(x,y) merge(x, y, by="key", all=TRUE), df_list)

这篇关于读取R中的多个xml文件并合并数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆