R WebScraping 使用 Rvest 时获取额外文本 [英] R WebScraping Getting Extra Text when using Rvest

查看:34
本文介绍了R WebScraping 使用 Rvest 时获取额外文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 R 和 RVest 网络抓取从 eBay 获取已售日期

网址是

解决方案

这里有 2 个很好的答案,其中包含有关此问题的更多详细信息:Rvest Split Data by类名发生变化的类名

I'm trying to get sold dates from eBay using R and RVest web scraping

The url is url

literally

https://www.ebay.com/sch/Star%20Wars%20%20BARC%20Speeder%20Bike%20Trooper%20Buzz%20-2009%20-Red%20-Obi-wan%20-Kenobi%20-Jesse%20-halmark%20-Funko%20-Pop%20-Black%20-snaptite%20-model%20-30th%20-Saga%20-Lego%20-McDonalds%20-McDonald%27s%20-Topps%20-Heroes%20-Playskool%20-Transformers%20-Titanium%20-Die-Cast%20-2003%20-2004%20-2005%20-2006%20-2007%20-2008%20-2012%20-2013%20%28Clone%20Wars%29&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1

The full xpath to the first item sold date is: //*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span/span[1]

If I use that and then html_text() to this path, I get nothing. character(0)

When I remove the spans, and add the POSITIVE node, I get the date, but also a bunch of extra text.

R code:

readHTML <- url %>%
            read_html()

    SoldDate <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div') %>%
        html_nodes("[class='POSITIVE']") %>%
        html_text(trim = TRUE)

Result:

"SoYlPd N Feb 316,Z RM9USI2021"

I should get:

"Feb 16, 2021"

解决方案

There are 2 great answers with more detail specifics on the issue here: Rvest Split Data by Class Name where the class names change

这篇关于R WebScraping 使用 Rvest 时获取额外文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆