页面项目不能用 rvest 抓取 [英] Page item not scrape-able with rvest

查看:52
本文介绍了页面项目不能用 rvest 抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 进行网络抓取,最近一直在做一些练习.我目前正在浏览本地 ebay 列表,在那里我能够抓取有关单个列表的文本信息.但是,我尝试了不同的选项来减少列表的查看次数.但没有什么能告诉我页面上显示的数字.

页面链接是这个

https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170>

虽然页面浏览量位于图片的右下方(当前为 00044 次浏览)

我能够使用此代码检索文本:

pageURL <- read_html("https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170")输入 <- pageURL %>%html_nodes(xpath="/html/body/div[1]/div[2]/div/section[1]/section/section/article/section[1]/section/dl") %>%html_text()write.csv2(输入,example_listing.csv")

非常感谢任何帮助 - 因为我没有看到视图节点有什么不同.我试过 xpath 和 full xpath 都没有结果.

解决方案

问题是您尝试抓取的元素中的文本在您正在解析的 html 中不存在.您可以通过执行以下操作来检查这一点:

library(magrittr)图书馆(httr)url <- paste("https://www.ebay-kleinanzeigen.de/s-anzeige/","zahnpflege-fuer-hunde-und-katzen-extra-stark","-gegen-mundgeruch/1281544930-313-3170", 崩溃 = "")页面 <- url %>% GET %>% content("text")substr(页面,72144,72177)#>[1] "<span id=\"viewad-cntr-num\"></span>"

但是,如果您在 Chrome 或 Firefox 的开发人员工具中查看此项目,您会发现此处应该有一个数字:

00047</span>

发生的情况是,当您使用网络浏览器时,您请求的页面包含 javascript,浏览器会自动运行该页面.在这种情况下,它会进一步向服务器发送下载额外信息的请求,并将其插入到页面中.

但是,当您使用 rvest 或类似工具时,会下载原始 html 页面,但不会运行 javascript.因此,不进行后续请求,空字段不可抓取.

在这种情况下,很容易找到下载页面浏览量的链接,因为该链接实际上位于您下载的 html 页面上:

url2 <- strsplit(strsplit(page, "viewAdCounterUrl: '")[[1]][2], "'")[[1]][1]网址2#>[1] "https://www.ebay-kleinanzeigen.de/s-vac-inc-get.json?adId=1281544930&userId=50592093"page_views <- url2 %>% GET %>% content("text")页面浏览量#>[1] "{\"numVisits\":52,\"numVisitsStr\":\"00052\"}"

您可以看到服务器返回了一个简短的 JSON,其中包含您要查找的内容.您可以手动执行 javascript 的操作并将信息重新插入页面,如下所示:

page_views <- strsplit(strsplit(page_views, "\":\"")[[1]][2], "\"")[[1]][1]标签 <- "<span id=\"viewad-cntr-num\">"页 <- sub(tag, paste0(tag, page_views), page)

现在你可以这样做:

输入<-页%>%read_html %>%html_nodes(xpath="//section[@class=\"l-container\"]") %>%html_text() %>% 提取(1)

您将获得所需的文本,包括页面浏览量.

I am getting into web scraping with R and recently have been doing some exercises. I am currently playing around the local ebay listings where I was able to scrape the text info about an individual listing. However, I have tried different options to also scrape the number of views of the listing. But nothing gives me the number shown on the page.

The Page Link is this

https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170

While the pageview Number is at the right-below of the image (currently 00044 views)

I was able to retrieve the text with this code:

pageURL <- read_html("https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170")
input <- pageURL %>%
  html_nodes(xpath="/html/body/div[1]/div[2]/div/section[1]/section/section/article/section[1]/section/dl") %>%
  html_text() 
write.csv2(input, "example_listing.csv")

Any help much appreciated - as I don't see a difference in the views node. I tried xpath and full xpath with no results.

解决方案

The problem is that the text in the element you are trying to scrape does not exist in the html you are parsing. You can check this by doing the following:

library(magrittr)
library(httr)

url <- paste("https://www.ebay-kleinanzeigen.de/s-anzeige/",
             "zahnpflege-fuer-hunde-und-katzen-extra-stark",
             "-gegen-mundgeruch/1281544930-313-3170", collapse = "")

page <- url %>% GET %>% content("text")
substr(page, 72144, 72177)
#>[1] "<span id=\"viewad-cntr-num\"></span>"

Yet if you look at this item in the developer tools in Chrome or Firefox, you can see there should be a number in here:

<span id="viewad-cntr-num">00047</span>

What happens is that when you are using a web browser, the page that you request contains javascript, which the browser automatically runs. In this case, it sends further requests to the server to download extra information and this is inserted on the page.

However, when you are using rvest or similar tools, the original html page is downloaded but the javascript is not run. Therefore, the subsequent requests are not made, and the empty field is not available to be scraped.

In this case, it is quite easy to find the link that downloads the number of page views, since that link is actually on the html page you downloaded:

url2 <- strsplit(strsplit(page, "viewAdCounterUrl: '")[[1]][2], "'")[[1]][1]
url2
#> [1] "https://www.ebay-kleinanzeigen.de/s-vac-inc-get.json?adId=1281544930&userId=50592093"
page_views <- url2 %>% GET %>% content("text")
page_views
#> [1] "{\"numVisits\":52,\"numVisitsStr\":\"00052\"}"

You can see that the server has returned a short JSON that contains the content you were looking for. You can manually do what javascript does and reinsert the information back into the page like this:

page_views <- strsplit(strsplit(page_views, "\":\"")[[1]][2], "\"")[[1]][1]
tag <- "<span id=\"viewad-cntr-num\">"
page <- sub(tag, paste0(tag, page_views), page)

Now you can do this:

input <- page %>% 
  read_html %>%
  html_nodes(xpath="//section[@class=\"l-container\"]") %>%
  html_text() %>% extract(1)

And you will have the text you were looking for, including the number of page views.

这篇关于页面项目不能用 rvest 抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆