将XML的所有字段(和子字段)导入为dataframe [英] Import all fields (and subfields) of XML as dataframe

查看:155
本文介绍了将XML的所有字段(和子字段)导入为dataframe的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要做一些分析,我想使用R和XML包将XML导入数据框。 XML文件示例:

To do some analysis I want to import a XML to a dataframe using R and the XML package. Example of XML file:

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>9.8</c_rating>
            <c_delivery/>
        </result>
        <result position="3">
            <c_name>CTest3</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>8.8</c_rating>
            <c_delivery/>
        </result>
    </results>
</watcher>
</watchers>

我想让数据帧的行包含以下字段:

I want to have the rows of the dataframe containing the following fields:

shop_name   created_at  code    search_key  date    result
link    price   shipping    origposition    name    
position    c_name  c_price c_shipping  c_total_price   
c_rating    c_delivery

这意味着必须考虑子节点,这将导致三行中的数据帧这个例子(因为结果显示3个位置)。字段

This means that the child nodes must be taken into account as well, which would result in a dataframe of three rows in this example (since the results show 3 positions). The fields

shop_name   created_at  code    search_key
date    result  link    price   shipping    
origposition    name

对于每一行都是相同的。

are the same for each of these rows.

我能够浏览XML文件,但是我无法获得包含我想要的字段的数据框。当我将数据帧转换为数据帧时,我得到以下字段:

I am able to go through the XML file, but I am unable to get a dataframe with the fields i want. When I convert the dataframe to a dataframe I get the following fields:

"code"       "search_key"      "date"     "result"  
"link" "price"      "shipping"   "origposition"  
"name"    "results"     

此处字段

shop_name   created_at

在开头缺失,'results'汇总在results列下的String中。

are missing at the beginning and the 'results' are put together in a String under the column "results".

必须有可能得到想要的数据帧,但我不知道如何准确地做到这一点。

It must be possible to get the wanted dataframe, but I do not know how to do this exactly.

更新

@MvG提供的解决方案在测试XML上运行卓越文件如上所述。但是,结果列也可以具有未找到值。具有此值的条目将丢失某些字段(始终是相同的字段),因此在运行解决方案时会产生参数列数不匹配-error。我希望这些条目也放在数据框中,不存在的字段留空。我不明白如何合并这种情况。

The solution provided by @MvG works brilliantly on the test XML file stated above. However the column 'result' can also have the value "Not Found". Entries with this value will miss certain fields (always the same filed) and therefore yield a "number of columns of arguments do not match"-error when running the solution. I would like these entries to be put in the dataframe as well, with the fields that are not present left empty. I do not understand how to incorporate this scenario.

test.xml

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>9.8</c_rating>
        <c_delivery/>
    </result>
    <result position="3">
        <c_name>CTest3</c_name>
        <c_price>654.0</c_price>
        <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>8.8</c_rating>
        <c_delivery/>
    </result>
</results>
</watcher>
<watcher channel="Shopping">
    <code>12804</code>
    <search_key></search_key>
    <date></date>
    <result>Not found</result>
    <link>https://www.test.com/testing1323p</link>
    <price>0.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>MOOVM6002020</name>
    <results>
    </results>
</watcher>
</watchers>


推荐答案

这是一种更通用的方法。每个节点都被归类为以下三种情况之一:

Here is a more generic approach. Every node is classified as one of three cases:


  • 如果节点名称是种类 rows ,然后来自子节点的数据帧将导致结果的不同行。

  • 如果节点名称是种类 cols ,然后来自子节点的数据帧将导致结果的不同列。

  • 如果节点名称是类型,然后将使用节点名称作为列名称并将节点值作为列值来构造具有单个值的数据框。

  • 对于所有三种情况,属性为节点将被添加到数据框中。

  • If the node name is of kind rows, then the data frames from child nodes will result in different rows of the result.
  • If the node name is of kind cols, then the data frames from child nodes will result in different columns of the result.
  • If the node name is of kind value, then a data frame with a single value will be constructed, using the node name as the column name and the node value as the column value.
  • For all three cases, attributes of the node will be added to the data frame.

您的应用程序的调用将在底部给出。

The call for your application is given towards the bottom.

library(XML)

zeroColSingleRow <- function() {
  res <- data.frame(dummy=NA)
  res$dummy <- NULL
  stopifnot(nrow(res) == 1, ncol(res) == 0)
  return (res)
}

xml2df <- function(node, classifier) {
  if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) {
    return (zeroColSingleRow())
  }
  kind <- classifier(node)
  if (kind == "rows") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      names <- unique(unlist(lapply(cdf, colnames)))
      cdf <- lapply(cdf, function(i) {
        missing <- setdiff(names, colnames(i))
        if (length(missing) > 0) {
          i[missing] <- NA
        }
        return (i)
      })
      res <- do.call(rbind, cdf)
    }
  }
  else if (kind == "cols") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      res <- cdf[[1]]
      if (length(cdf) > 1) {
        for (i in 2:length(cdf)) {
          res <- merge(res, cdf[[i]], by=NULL)
        }
      }
    }
  }
  else {
    stopifnot(kind == "value")
    res <- data.frame(xmlValue(node))
    names(res) <- xmlName(node)
  }
  if (ncol(res) == 0) {
    res <- zeroColSingleRow()
  }
  attr <- xmlAttrs(node)
  if (length(attr) > 0) {
    attr <- do.call(data.frame, as.list(attr))
    res <- merge(attr, res, by=NULL)
  }
  rownames(res) <- NULL
  return(res)
}

doc<-xmlParse("test.xml")

xml2df(xmlRoot(doc), function(node) {
  name <- xmlName(node)
  if (name %in% c("watchers", "results"))
    return("rows")
  # make sure to treat results/result different from watcher/result
  if (name %in% c("watcher", "result") &&
      xmlName(xmlParent(node)) == paste0(name, "s"))
    return("cols")
  return("value")
})

这篇关于将XML的所有字段(和子字段)导入为dataframe的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆