R 和 xpathApply -- 从嵌套的 html 标签中删除重复项 [英] R and xpathApply -- removing duplicates from nested html tags

查看:45
本文介绍了R 和 xpathApply -- 从嵌套的 html 标签中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了简洁和清晰,我编辑了问题

我的目标是找到将导致test1"...test8"单独列出的 XPath 表达式.

我正在使用 xpathApply 从网页中提取文本.由于要从中提取信息的各种不同页面的布局,我需要从所有

html 标签中提取 XML 值.我遇到的问题是当一种类型嵌套在另一种类型中时,当我使用带有 or 条件的以下 xpathApply 表达式时会导致部分重复.

I am working with xpathApply to extract text from web pages. Due to the layout of various different pages that information will be pulled from, I need to extract the XML values from all <font> and <p> html tags. The problem I run into is when one type is nested within the other, resulting in partial duplicates when I use the following xpathApply expression with an or condition.

require(XML)    
html <- 
  '<!DOCTYPE html>
  <html lang="en">
    <body>
      <p>test1</p>
      <font>test2</font>
      <p><font>test3</font></p>
      <font><p>test4</p></font>
      <p>test5<font>test6</font></p>    
      <font>test7<p>test8</p></font>
    </body>
  </html>'
work <- htmlTreeParse(html, useInternal = TRUE, encoding='UTF-8')
table <- xpathApply(work, "//p|//font", xmlValue) 
table

应该很容易看出嵌套带来的问题类型——因为有时

标签是嵌套的,并且有时它们不是,我不能忽略它们,但同时搜索它们会给我部分欺骗.出于其他原因,我更喜欢将文本片段分解而不是聚合(即从最低级别/最远嵌套标签中提取).

It should be easy to see the type of issue that comes with the nesting--because sometimes <font> and <p> tags are nested, and sometimes they aren't, I can't ignore them but searching for both gives me partial dupes. For other reasons, I prefer the text pieces to be broken up rather than aggregated (that is, taken from the lowest level/furthest nested tag).

我不只是进行两次单独的搜索,然后在删除重复字符串后追加它们的原因是我需要保留文本在 html 中出现的顺序.

The reason I am not just doing two separate searches and then appending them after removing duplicate strings is that I need to preserve the ordering of text as it appears in the html.

感谢阅读!

推荐答案

看起来这可能有效

xpathSApply(work, "//body//node()[//p|//font]//text()", xmlValue)
# [1] "test1" "test2" "test3" "test4" "test5" "test6" "test7" "test8"

只需切换到 xpathApply 即可获得列表结果.我们也可以使用 getNodeSet

Just switch to xpathApply for the list result. We could also use getNodeSet

getNodeSet(work, "//body//node()[//p|//font]//text()", fun = xmlValue)
# [[1]]
# [1] "test1"
# 
# [[2]]
# [1] "test2"
# 
# [[3]]
# [1] "test3"
# 
# [[4]]
# [1] "test4"
# 
# [[5]]
# [1] "test5"
# 
# [[6]]
# [1] "test6"
# 
# [[7]]
# [1] "test7"
# 
# [[8]]
# [1] "test8"

这篇关于R 和 xpathApply -- 从嵌套的 html 标签中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆