将数据从xml转换为R数据帧 [英] Transforming data from xml into R dataframe

查看:251
本文介绍了将数据从xml转换为R数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将xml文件转换为数据框,但格式似乎是关闭的。我已经看过不同的教程了,虽然我已经适度地获得了使用for循环和导航解析文件所需的信息,但是我被告知这个解决方案并不是非常有效。



我尝试过这个代码:

  require(XML)
parsed< -xmlParse(SEWL.xml)
xmlToDataFrame(解析)

但它给出了错误:中的错误[< - 。data.frame * tmp * ,i,names(nodes [ i]]),value = c(\LL18179\\2016 / 08\0.32485.43896.59801.2131\OK\,
列的重复下标



此其他代码可以正常工作,但格式不是我需要的:

  require(XML)
require(plyr)
pldf< -ldply(xmlToList(SEWL.xml),data.frame)

生成的数据框如下:

  .id X. .i ..文本.attrs test.code t est.validuntil test.meas.text test.meas..attrs test.meas.text.1 
1技术员John< NA> < NA> < NA> < NA> < NA> < NA> < NA>
2位置CO< NA> < NA> < NA> < NA> < NA> < NA> < NA>
3 temp< NA> 21.3摄氏度< NA> < NA> < NA> < NA> < NA>
4 runtyperoutine< NA> < NA> < NA> < NA> < NA> < NA> < NA>
5样本< NA> < NA> 2323LL181792016/080.3248基线5.4389
6样本< NA> < NA> 2323LL181792016/080.3248基线5.4389
7样本< NA> < NA> 8979237AA094532016/030.0117基线5.6012
8样本< NA> < NA> 8979237AA094532016/030.0117基线5.6012
9 .attrs 2015_07_31_11_33_22< NA> < NA> < NA> < NA> < NA> < NA> < NA>
10 .attrs 20150731< NA> < NA> < NA> < NA> < NA> < NA> < NA>
11 .attrs 113322< NA> < NA> < NA> < NA> < NA> < NA> < NA>
test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
2< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
3< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
4< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
5 std 6.5980 data 1.2131OKlaslumATR1506072017/05
6 std 6.5980 data 1.2131OK3ATR1506072017/05
7 std 1.1431 data 0.2041FAILabsat < NA>
8 std 1.1431 data 0.2041FAIL2< NA> < NA>
9 < NA> < NA> < NA> < NA> < NA> < NA> < NA>
10< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
11< NA> < NA> < NA> < NA> < NA> < NA> < NA> < NA>
test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1< NA> < NA> < NA> < NA> < NA> < NA>
2< NA> < NA> < NA> < NA> < NA> < NA>
3< NA> < NA> < NA> < NA> < NA> < NA>
4< NA> < NA> < NA> < NA> < NA> < NA>
5 0.0673基准4.9721标准10.3851数据
6 0.0673基线4.9721标准10.3851数据
7 < NA> < NA> < NA> < NA> < NA>
8< NA> < NA> < NA> < NA> < NA> < NA>
9< NA> < NA> < NA> < NA> < NA> < NA>
10< NA> < NA> < NA> < NA> < NA> < NA>
11 < NA> < NA> < NA> < NA> < NA>
test.calc.1 test.result.1 test..attrs.1
1< NA> < NA> < NA>
2< NA> < NA> < NA>
3< NA> < NA> < NA>
4< NA> < NA> < NA>
5 2.0886警告atr
6 2.0886警告1
7< NA> < NA> < NA>
8< NA> < NA> < NA>
9< NA> < NA> < NA>
10< NA> < NA> < NA>
11< NA> < NA> < NA>

这是我使用的示例XML文件:

 <?xml version =1.0encoding =UTF-8?> 
< experiment name =abc123date =20150731time =113322>
< technician>John< / technician>
< location>CO< / location>
< temp scale =celsius> 21.3< / temp>
< runtype>例程< / runtype>
< sample id =2323>
< test name =laslumorder =3>
< code>LL18179< / code>
< validuntil>2016/08< / validuntil>
< meas name =baseline> 0.3248< / meas>
< meas name =std> 5.4389< / meas>
< meas name =data> 6.5980< / meas>
< calc> 1.2131< / calc>
< result>OK< / result>
< / test>
< test name =atrorder =1>
< code>ATR150607< / code>
< validuntil>2017/05< / validuntil>
< meas name =baseline> 0.0673< / meas>
< meas name =std> 4.9721< / meas>
< meas name =data> 10.3851< / meas>
< calc> 2.0886< / calc>
< result>Warning< / result>
< / test>
< / sample>
< sample id =8979237>
< test name =absatorder =2>
< code>AA09453< / code>
< validuntil>2016/03< / validuntil>
< meas name =baseline> 0.0117< / meas>
< meas name =std> 5.6012< / meas>
< meas name =data> 1.1431< / meas>
< calc> 0.2041< / calc>
< result>FAIL< / result>
< / test>
< / sample>
< / experiment>

我希望获得的数据框:

 实验技术人员位置temp runtype示例测试订单代码validuntil基准std数据计算结果日期时间
1 abc123 John CO 21.3程序2323 laslum 3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131 OK 20150731 113322
2 abc123 John CO 21.3程序2323 atr 1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886警告20150731 113322
3 abc123 John CO 21.3程序8979237 absat 2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322

我不需要完全相同的格式,只是足够近,所以我可以转换

解决方案

我们提供了两种方法来解析XML。第一个(通过实验/样本/测试执行三次迭代)可能运行得更快,但第二个(使用测试节点上的单个循环,并且每个测试节点通过树进行备份以获取其祖先)具有更简单的代码。



1)在笔记中使用,我们实现了一个三重xpathApply / xpathSA在实验/样本/测试节点上应用迭代。 e s t 表示当前这样的节点,

 库(XML)
doc < - xmlTreeParse(Lines,asText = TRUE,useInternalNodes = TRUE)

do.call(rbind,xpathApply(doc,// experiment,function(e){
data.frame(experiment = xmlAttrs(e)[[name ]],
technician = xmlValue(e [[technician]]),
location = xmlValue(e [[location]]),
temp = xmlValue(e [ temp]]),
runtype = xmlValue(e [[runtype]]),
t(do.call(cbind,xpathApply(e,sample b $ b sample< - xmlAttrs(s)[[id]]
xpathSApply(s,test,function(t){
c(sample = sample,
test = xmlAttrs(t)[[name]],
order = xmlAttrs(t)[[order]],
code = xmlValue(t [[code]]),
validuntil = xmlValue(t [[validuntil]]),
baseline = xmlValue(t [meas] [[1]]),
std = xmlValue(t [meas] [[2]]),
data = xmlValue(t [ ...,[b] [b] [b] [b] [b] [b] [b] )})),
date = xmlAttrs(e)[[date]],
time = xmlAttrs(e)[[time]]
)}) )

给:

 实验技术人员位置温度runtype样本测试订单
1 abc123约翰CO21.3例程2323 laslum 3
2 abc123约翰CO21.3例程2323 atr 1
3 abc123约翰CO21.3例程8979237 absat 2
代码validuntil基准标准数据计算结果日期
1LL181792016/080.3248 5.4389 6.5980 1.2131 OK20150731
2ATR1506072017/050.0673 4.9721 10.385 1 2.0886警告20150731
3AA094532016/030.0117 5.6012 1.1431 0.2041失败20150731
时间
1 113322
2 113322
3 113322

2)这是一种替代方法,测试节点,然后向上进入父母和祖父母以获取相应的样本和实验信息。

 库(XML)
doc < - xmlTreeParse(Lines,asText = TRUE,useInternalNodes = TRUE)

do.call(rbind,xpathApply(doc,// test,function(t) t是测试节点
s< - xmlParent(t)#s是样本节点
e < - xmlParent(s)#e是实验节点
data.frame(experiment = xmlAttrs(e) [[name]],
technician = xmlValue(e [[technician]]),
location = xmlValue(e [[location]]),
temp = xmlValue(e [[temp]]),
runtype = xmlValue(e [[runtype]]),
sample = xm lAttrs(s)[[id]],
test = xmlAttrs(t)[[name]],
order = xmlAttrs(t)[[order]] $ b code = xmlValue(t [[code]]),
validuntil = xmlValue(t [[validuntil]]),
baseline = xmlValue(t [meas] [[ 1]]),
std = xmlValue(t [meas] [[2]]),
data = xmlValue(t [meas] [[3]])
calc = xmlValue(t [[calc]]),
result = xmlValue(t [[result]]),
date = xmlAttrs(e)[[date]] ,
time = xmlAttrs(e)[[time]]

}))

给出:

 实验技术人员位置temp runtype示例测试订单
1 abc123约翰CO21.3例行2323 laslum 3
2 abc123约翰CO21.3例行2323 atr 1
3 abc123约翰CO21.3例程8979237 absat 2
代码validuntil基线std数据计算结果日期
1LL181792016/080.3248 5.4389 6.5980 1.2131确定20150731
2ATR1506072017/050.0673 4.9721 10.3851 2.0886警告20150731
3AA09453 2016/030.0117 5.6012 1.1431 0.2041失败20150731
时间
1 113322
2 113322
3 113322
pre>

注意1:



如果您读取输入XML文件SEWL.xml到Excel中,它将做一个合理的工作,将其放入表格格式,虽然需要进一步处理才能将其精确地纳入问题的形式。



注2:



输入 Lines 作为R对象是:

 行<  - '<?xml version =1.0encoding =UTF-8?> 
< experiment name =abc123date =20150731time =113322>
< technician>John< / technician>
< location>CO< / location>
< temp scale =celsius> 21.3< / temp>
< runtype>例程< / runtype>
< sample id =2323>
< test name =laslumorder =3>
< code>LL18179< / code>
< validuntil>2016/08< / validuntil>
< meas name =baseline> 0.3248< / meas>
< meas name =std> 5.4389< / meas>
< meas name =data> 6.5980< / meas>
< calc> 1.2131< / calc>
< result>OK< / result>
< / test>
< test name =atrorder =1>
< code>ATR150607< / code>
< validuntil>2017/05< / validuntil>
< meas name =baseline> 0.0673< / meas>
< meas name =std> 4.9721< / meas>
< meas name =data> 10.3851< / meas>
< calc> 2.0886< / calc>
< result>Warning< / result>
< / test>
< / sample>
< sample id =8979237>
< test name =absatorder =2>
< code>AA09453< / code>
< validuntil>2016/03< / validuntil>
< meas name =baseline> 0.0117< / meas>
< meas name =std> 5.6012< / meas>
< meas name =data> 1.1431< / meas>
< calc> 0.2041< / calc>
< result>FAIL< / result>
< / test>
< / sample>
< / experiment>'


I'm trying to convert an xml file to a dataframe, but the format seems to be off. I've looked at different tutorials and, while I've been moderately succesful at getting the information I need using a for loop and navigating the parsed file, I've been told that this solution is not very efficient.

I tried this code then:

require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)

But it gives an error: Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c("\"LL18179\"\"2016/08\"0.32485.43896.59801.2131\"OK\"", : duplicate subscripts for columns

This other code works, but the formatting is not what I need:

require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)

The resulting dataframe is as follows:

          .id              X..i.. text  .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1  technician              "John" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
2    location                "CO" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
3        temp                <NA> 21.3 celsius      <NA>            <NA>           <NA>             <NA>             <NA>
4     runtype           "routine" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
5      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
6      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
7      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
8      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
9      .attrs 2015_07_31_11_33_22 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
10     .attrs            20150731 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
11     .attrs              113322 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
   test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
2                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
3                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
4                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
5                 std           6.5980               data    1.2131        "OK"      laslum "ATR150607"         "2017/05"
6                 std           6.5980               data    1.2131        "OK"           3 "ATR150607"         "2017/05"
7                 std           1.1431               data    0.2041      "FAIL"       absat        <NA>              <NA>
8                 std           1.1431               data    0.2041      "FAIL"           2        <NA>              <NA>
9                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
10               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
11               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
   test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
2              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
3              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
4              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
5            0.0673           baseline           4.9721                std          10.3851               data
6            0.0673           baseline           4.9721                std          10.3851               data
7              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
8              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
9              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
10             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
11             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
   test.calc.1 test.result.1 test..attrs.1
1         <NA>          <NA>          <NA>
2         <NA>          <NA>          <NA>
3         <NA>          <NA>          <NA>
4         <NA>          <NA>          <NA>
5       2.0886     "Warning"           atr
6       2.0886     "Warning"             1
7         <NA>          <NA>          <NA>
8         <NA>          <NA>          <NA>
9         <NA>          <NA>          <NA>
10        <NA>          <NA>          <NA>
11        <NA>          <NA>          <NA>

This is the example XML file that I'm using:

<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>

And the dataframe that I'm hoping to get:

  experiment technician location temp runtype  sample   test order      code validuntil baseline    std    data   calc  result     date   time
1     abc123       John       CO 21.3 routine    2323 laslum     3   LL18179    2016/08   0.3248 5.4389  6.5980 1.2131      OK 20150731 113322
2     abc123       John       CO 21.3 routine    2323    atr     1 ATR150607    2017/05   0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3     abc123       John       CO 21.3 routine 8979237  absat     2   AA09453    2016/03   0.0117 5.6012  1.1431 0.2041    FAIL 20150731 113322

I don't need the exact same format, just something close enough so I can transform it into the example.

解决方案

We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.

1) Using Lines in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e, s and t represent the current such node, respectively.

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//experiment", function(e) {
  data.frame(experiment = xmlAttrs(e)[["name"]],
       technician = xmlValue(e[["technician"]]),
       location = xmlValue(e[["location"]]),
       temp = xmlValue(e[["temp"]]),
       runtype = xmlValue(e[["runtype"]]),
       t(do.call(cbind, xpathApply(e, "sample", function(s) {
            sample <- xmlAttrs(s)[["id"]]
            xpathSApply(s, "test", function(t) {
                   c(sample = sample,
                        test = xmlAttrs(t)[["name"]],
                        order = xmlAttrs(t)[["order"]],
                        code = xmlValue(t[["code"]]),
                        validuntil = xmlValue(t[["validuntil"]]),
                        baseline = xmlValue(t["meas"][[1]]),
                        std = xmlValue(t["meas"][[2]]),
                        data = xmlValue(t["meas"][[3]]),
                        calc = xmlValue(t[["calc"]]),
                        result = xmlValue(t[["result"]])
             )})}))),
       date = xmlAttrs(e)[["date"]],
       time = xmlAttrs(e)[["time"]]
)}))

giving:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
        s <- xmlParent(t) # s is sample node
        e <- xmlParent(s) # e is experiment node
        data.frame(experiment = xmlAttrs(e)[["name"]],
          technician = xmlValue(e[["technician"]]),
          location = xmlValue(e[["location"]]),
          temp = xmlValue(e[["temp"]]),
          runtype = xmlValue(e[["runtype"]]),
          sample = xmlAttrs(s)[["id"]],
          test = xmlAttrs(t)[["name"]],
          order = xmlAttrs(t)[["order"]],
          code = xmlValue(t[["code"]]),
          validuntil = xmlValue(t[["validuntil"]]),
          baseline = xmlValue(t["meas"][[1]]),
          std = xmlValue(t["meas"][[2]]),
          data = xmlValue(t["meas"][[3]]),
          calc = xmlValue(t[["calc"]]),
          result = xmlValue(t[["result"]]),
          date = xmlAttrs(e)[["date"]],
          time = xmlAttrs(e)[["time"]]
       )
}))

giving:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

Note 1:

As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.

Note 2:

The input Lines as an R object is:

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>'

这篇关于将数据从xml转换为R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆