R:站点时如何获取父属性和节点值? [英] R: How to get parent attributes and node values at the site time?
问题描述
我有一个像这样的 html 和 R 代码,需要将每个节点值与其父 ID 关联到 data.frame 中.每个人都有一些不同的信息.
example <-"<div class='phone'>555-5555</div><div class='email'>jhon@123.com</div><div class='person' id='2'><div class='phone'>123-4567</div><div class='email'>maria@gmail.com</div>
<div class='person' id='3'><div class='phone'>987-6543</div><div class='age'>32</div><div class='city'>纽约</div>
"doc = htmlTreeParse(例如,useInternalNodes = T)值 <- xpathSApply(doc, "//*[@class='person']/div", xmlValue)变量 <- xpathSApply(doc, "//*[@class='person']/div", xmlGetAttr, 'class')id <- xpathSApply(doc, "///*[@class='person']", xmlGetAttr, 'id')# 问题:创建一个data.frame(id,variables,values)
使用 xpathSApply()
,我还可以获得电话、电子邮件和年龄值以及人员属性 (id).但是,这些信息是孤立的,我需要将它们引用到正确的 data.frame 变量和正确的人.在我的真实数据中,有很多不同的信息,因此命名每个变量的过程必须是自动的.
我的目标是创建一个像这样的 data.frame,将每个 id 与其正确的数据相关联.
id 变量值1 1 电话 555-55552 1 电子邮件 jhon@123.com3 2 电话 123-45674 2 电子邮件 maria@gmail.com5 3 电话 987-65436 3 年龄 327 3 城市纽约
我相信我必须创建一个在 xpathSApply
中使用的函数,它会同时获取人员电话和人员 ID,因此它们是相关的,但我没有到目前为止,任何成功.
有人可以帮我吗?
总的来说,这并不容易:
idNodes <- getNodeSet(doc, "//div[@id]")ids <- lapply(idNodes, function(x) xmlAttrs(x)['id'])值 <- lapply(idNodes, xpathApply, path = './div[@class]', xmlValue)属性 <- lapply(idNodes, xpathApply, path = './div[@class]', xmlAttrs)do.call(rbind.data.frame, mapply(cbind, ids, values, attributes))V1 V2 V31 1 555-5555 电话2 1 jhon@123.com 邮箱3 2 123-4567 电话4 2 maria@gmail.com 电子邮件5 3 987-6543 电话6 3 32 年龄7 3 纽约市
假设它们嵌套在具有关联的 id
的 div
中,以上将为您提供属性和值对.
更新:如果要将其包装在 xpathApply 类型调用中
utilFun <- function(x){id <- xmlGetAttr(x, 'id')值 <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlValue)属性 <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)data.frame(id = id,属性 = 属性,值 = 值,stringsAsFactors = FALSE)}res <- xpathApply(doc, '//div[@id]', utilFun)do.call(rbind, res)id 属性值1 1 电话 555-55552 1 电子邮件 jhon@123.com3 2 电话 123-45674 2 电子邮件 maria@gmail.com5 3 电话 987-65436 3 年龄 327 3 城市纽约
I have a html and a R code like these and need to relate each node value to its parent id in a data.frame. There are some different information available for each person.
example <- "<div class='person' id='1'>
<div class='phone'>555-5555</div>
<div class='email'>jhon@123.com</div>
</div>
<div class='person' id='2'>
<div class='phone'>123-4567</div>
<div class='email'>maria@gmail.com</div>
</div>
<div class='person' id='3'>
<div class='phone'>987-6543</div>
<div class='age'>32</div>
<div class='city'>New York</div>
</div>"
doc = htmlTreeParse(example, useInternalNodes = T)
values <- xpathSApply(doc, "//*[@class='person']/div", xmlValue)
variables <- xpathSApply(doc, "//*[@class='person']/div", xmlGetAttr, 'class')
id <- xpathSApply(doc, "//*[@class='person']", xmlGetAttr, 'id')
# The problem: create a data.frame(id,variables,values)
With xpathSApply()
, I can get phone, email, and age values as well as person attributes (id) too. However, those information come isolated and I need to reference them to the right data.frame variable and the right person. In my real data there are a lot of different information, so this process of naming each variable has to be automatic.
My goal is to create a data.frame like this relating each id to its proper data.
id variables values
1 1 phone 555-5555
2 1 email jhon@123.com
3 2 phone 123-4567
4 2 email maria@gmail.com
5 3 phone 987-6543
6 3 age 32
7 3 city New York
I believe I would have to create a function to use inside xpathSApply
which would get at the same time the person phone and the person id, so they would be related, but I haven't had any success with that so far.
Can anyone help me?
In general its not going to be easy:
idNodes <- getNodeSet(doc, "//div[@id]")
ids <- lapply(idNodes, function(x) xmlAttrs(x)['id'])
values <- lapply(idNodes, xpathApply, path = './div[@class]', xmlValue)
attributes <- lapply(idNodes, xpathApply, path = './div[@class]', xmlAttrs)
do.call(rbind.data.frame, mapply(cbind, ids, values, attributes))
V1 V2 V3
1 1 555-5555 phone
2 1 jhon@123.com email
3 2 123-4567 phone
4 2 maria@gmail.com email
5 3 987-6543 phone
6 3 32 age
7 3 New York city
The above will give you attribute and value pairs assumming they are nested in a div
with an associated id
.
UPDATE: if you want to wrap it in an xpathApply type call
utilFun <- function(x){
id <- xmlGetAttr(x, 'id')
values <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlValue)
attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
data.frame(id = id, attributes = attributes, values = values, stringsAsFactors = FALSE)
}
res <- xpathApply(doc, '//div[@id]', utilFun)
do.call(rbind, res)
id attributes values
1 1 phone 555-5555
2 1 email jhon@123.com
3 2 phone 123-4567
4 2 email maria@gmail.com
5 3 phone 987-6543
6 3 age 32
7 3 city New York
这篇关于R:站点时如何获取父属性和节点值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!