R 不同文件结构中的 XML 数据 [英] XML Data in R different Filestructure

查看:23
本文介绍了R 不同文件结构中的 XML 数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析 2000 个 XML 文件.我管理了可以自动从文件中获取数据的设置.由于我是一个完整的初学者,它可能看起来很乱,这里举个例子:

I need to parse 2000 XML Files. I managed setting that I can automatically get my data from the files. Since I am a complete beginner, it maybe looks messy, here an example:

filenames <- list.files("C:/...", recursive=TRUE, full.names=TRUE, pattern=".xml")

name <- unlist(lapply(filenames, function(f) {  
  xml <- xmlParse(f)  
  xpathSApply(xml, "//...", xmlValue)
}))
data <- data.frame(name)

这适用于我需要的大部分数据,但我目前的问题是某些文件遗漏了某些数据,因此由于行数不同,我无法包含它们.这些文件的示例如下:文件 1:

This works for most of my needed data but my current problem is that some files miss a certain data so I can't include them because of different number of rows. An example of what the files look like is: File 1:

<Kontaktdaten>
   <Name> Name </Name>
   <ID>12345678</ID>
   <Kontakt_Zugang>
       <Strasse>ABC-Strasse</Strasse>
       <Hausnummer>1</Hausnummer>
       <Postleitzahl>12345</Postleitzahl>
       <Ort>ABC</Ort>
   </Kontakt_Zugang> 
</Kontaktdaten>

文件 2(例如缺少Hausnummer"):

File 2 (where "Hausnummer" is missing for example):

<Kontaktdaten>
   <Name> Name2 </Name>
   <ID>8765321</ID>
   <Kontakt_Zugang>
       <Strasse>CBA-Strasse</Strasse>
       <Postleitzahl>54321</Postleitzahl>
       <Ort>CBA</Ort>
   </Kontakt_Zugang> 
</Kontaktdaten>

有什么方法可以将它们组合在一个 data.frame 中,或者仅使用Hausnummer"和 ID 创建第二个 data.frame?

Is there any way how I can combine them anyway in one data.frame or create a second data.frame only with the "Hausnummer" and the ID?

这只是显示我的问题的示例.原始文件长达 500 个节点,其中一些节点增加了一倍.

This is only an example to show my problem. The original files are up to 500 nodes long, some of them are doubled.

推荐答案

考虑特殊目的语言,XSLT,旨在为最终使用解决方案转换 XML 文件,例如展平嵌套节点 Kontakt_Zugang 以导入 R 并迁移到数据框.

Consider the special purpose language, XSLT, designed to transform XML files for end use solutions such as flattening the nested node Kontakt_Zugang for import into R and migrated into data frame.

XSLT (另存为 .xsl 文件,像任何 .xml 文件一样解析为 R)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="Kontakt_Zugang">
    <xsl:apply-templates select="@*|node()"/>
</xsl:template>

</xsl:stylesheet>

在线演示

R

library(xml2)
library(xslt)

# RETRIEVE XML FILE NAMES
filenames <- list.files("C:/...", recursive=TRUE, full.names=TRUE, pattern=".xml")
all_cols <- c("Name", "ID", "Strasse", "Hausnummer", "Postleitzahl", "Ort")

# PARSE XSLT
style <- read_xml("/path/to/xslt_script.xsl", package = "xslt")

df_list <- lapply(filenames, function(f) {  
  # PARSE XML
  xml <- xml2::read_xml(f)    
  # TRANSFORM INPUT INTO OUTPUT
  new_xml <- xslt::xml_xslt(xml, style)

  # BUILD DATA FRAME
  vals <- xml_children(xml_find_all(new_xml, "//Kontaktdaten"))
  df <- setNames(data.frame(t(trimws(xml_text(vals)))), xml_name(vals))

  # FILL IN MISSING COLUMNS
  df[all_cols[!(all_cols %in% colnames(df))]] <- NA

  return(df[all_cols])
})

final_df <- do.call(rbind, df_list)
final_df
#    Name       ID     Strasse Hausnummer Postleitzahl Ort
# 1  Name 12345678 ABC-Strasse          1        12345 ABC
# 2 Name2  8765321 CBA-Strasse       <NA>        54321 CBA

顺便说一下,因为 XSLT 是一种特殊用途的语言,它不限于 R,而是任何支持它的语言,例如 Java、PHP、Python 甚至 外部处理器 R 可以通过命令行调用来运行.例如,下面使用 Unix(即 Mac 和 Linux)xsltproc:

By the way, because XSLT is a special-purpose language, it is not restricted to R but any language such as Java, PHP, Python that supports it and even external processors that R can make a command line call to run. As example, below uses Unix's (i.e., Mac and Linux) xsltproc:

# COMMAND LINE CALL TO UNIX'S XSLTPROC (ALTERNATIVE TO xslt PACKAGE)
system("xsltproc -o /path/to/input.xml /path/to/xslt_script.xsl /path/to/output.xml")
doc <- xmlParse("/path/to/output.xml")

这篇关于R 不同文件结构中的 XML 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆