在刮痧中相当于哪个? [英] Equivalent of which in scraping?

查看:50
本文介绍了在刮痧中相当于哪个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行一些抓取,其中我对节点采取的操作取决于节点的内容.

I'm trying to run some scraping where the action I take on a node is conditional on the contents of the node.

这应该是一个最小的例子:

This should be a minimal example:

XML =
'<td class="id-tag">
    <span title="Really Long Text">Really L...</span>
</td>
<td class="id-tag">Short</td>'

page = read_html(XML)

基本上,如果 存在,我想提取 html_attr(x, "title"),否则只获取 html_text(x)代码>.

Basically, I want to extract html_attr(x, "title") if <span> exists, otherwise just get html_text(x).

首先要做的代码是:

page %>% html_nodes(xpath = '//td[@class="id-tag"]/span') %>% html_attr("title")
# [1] "Really Long Text"

做第二个的代码是:

page %>% html_nodes(xpath = '//td[@class="id-tag"]') %>% html_text
# [1] "\n    Really L...\n" "Short"  

真正的问题是 html_attr 方法没有给我任何 NA 或类似的东西对于不匹配的节点(即使我让 xpath 只是 '//td[@class="id-tag"]' 首先确保我已经缩小到只有相关的节点.这会破坏order -- 我无法自动判断原始结构在第一个节点还是第二个节点处有 Really Long Text".

The real problem is that the html_attr approach doesn't give me any NA or something similar for the nodes that don't match (even if I let the xpath just be '//td[@class="id-tag"]' first to be sure I've narrowed down to only the relevant nodes first. This destroys the order -- I can't tell automatically whether the original structure had "Really Long Text" at the first or the second node.

(我想过做join,但是缩写文本和全文之间的映射不是一对一/可逆的).

(I thought of doing a join, but the mapping between the abbreviated text and the full text is not one-to-one/invertible).

这个好像在右边path -- xpath 中的 if/else 结构 -- 但不起作用.

This seems to be on the right path -- an if/else construction within the xpath -- but doesn't work.

理想情况下我会得到输出:

Ideally I'd get the output:

# [1] "Really Long Text" "Short" 

推荐答案

基于 R 使用管道运算符 %>% 时的条件评估,您可以执行类似

Based on R Conditional evaluation when using the pipe operator %>%, you can do something like

page %>% 
   html_nodes(xpath='//td[@class="id-tag"]') %>% 
   {ifelse(is.na(html_node(.,xpath="span")), 
           html_text(.),
           {html_node(.,xpath="span") %>% html_attr("title")}
   )}

我认为丢弃管道并保存沿途创建的一些对象可能很简单

I think it is possibly simple to discard the pipe and save some of the objects created along the way

nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)

xpath 唯一的方法可能是

An xpath only approach might be

page %>% 
 html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
 html_text()

这篇关于在刮痧中相当于哪个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆