R:Rvest - 得到了我不想要的隐藏文本 [英] R: Rvest - got hidden text i don't want

查看:41
本文介绍了R:Rvest - 得到了我不想要的隐藏文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对这个网站进行网络抓取:

I'm doing webscraping to this web:

http://www.falabella.com.pe/falabella-pe/category/cat40536/Climatizacion?navAction=push

我只需要产品的信息:品牌"、产品名称"、价格".

I just need the information from the products: "brand", "name of product", "price".

我可以得到这个信息,但我也从其他用户的类似产品的横幅中获取信息.我不需要它.

I can get that, but also i get the information from a banner with similar products by other users. I don't need it.

但是当我转到页面的源代码时,我看不到那些产品.我认为它是通过 javascript 或其他东西拉出来的:

But when i go to the source code of the page, i can't see those products. I think it's been pulled through javascript or something:

问题 1: 如何在进行网络抓取时阻止此信息?这会添加我不需要的产品.但是在源码中看不到这部分.

QUESTION 1: How to block this information when doing the web scraping? This adds products that i don't need. But can't see this part in the source code.

问题 2:提取价格precio1"时,我将其作为第一个元素:"\n\t\t\t\tSubtotal InternetS/.0" 我在代码源中也看不到.如何不刮呢?

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.falabella.com.pe/falabella-pe/category/cat40536/Climatizacion?navAction=push")


page_source<-remDr$getPageSource()


Climatizacion_marcas1 <- html(page_source[[1]])%>%
        html_nodes(".marca") %>%
        html_nodes("a") %>%
        html_attr("title")


Climatizacion_producto1 <- html(page_source[[1]])%>%
        html_nodes(".detalle") %>%
        html_nodes("a") %>%
        html_attr("title")


Climatizacion_precio1 <- html(page_source[[1]])%>%
        html_nodes(".precio1") %>%
        html_text()

推荐答案

靠近你的方法,这样做:

Staying close to your approach, this will do it:

library(rvest)
u <- "http://www.falabella.com.pe/falabella-pe/category/cat40536/Climatizacion?navAction=push"
doc <- html(u)

Climatizacion_marcas1 <- doc %>% 
  html_nodes(".marca")[[1]] %>%
  html_nodes("a") %>%
  html_attr("title")

Climatizacion_producto1 <- doc %>% 
  html_nodes(".detalle") %>%
  html_nodes("a") %>%
  html_attr("title")

\n\t\t"等来自解析html.显然,那里有回车和制表符.一个简单的解决方案是:

The "\n\t\t" etc. comes from the parsing of the html. Apparently, there are carriage returns and tabs in there. A simple solution is:

Climatizacion_precio1 <- doc %>% 
  html_node(".precio1") %>%
  html_text() %>% 
  stringr::str_extract_all("[:number:]{1,4}[.][:number:]{1,2}", simplify = TRUE) %>% 
  as.numeric

Climatizacion_precio1
[1] 44.9

这实际上是从字符串中选取数字(因此也删除了S/.".如果您希望保留S/.",您可以执行以下操作:

This, in fact, picks the number from the string (thus also removing the "S/.". In case you want the "S/." to stay, you can do the following:

Climatizacion_precio1 <- doc %>% 
  html_node(".precio1") %>%
  html_text() %>% 
  gsub('[\r\n\t]', '', .)

Climatizacion_precio1
[1] "S/. 44.90"

编辑这是另一种方法,使用 XMLselectr.这将一次性获得页面上所有项目的信息.

EDIT Here is an alternative approach, using XML and selectr. This will get the info for all of the items on the page in one go.

library(XML)

clean_up <- function(x) {
  stringr::str_replace_all(x, "[\r\t\n]", "")
}

product <- selectr::querySelectorAll(doc, ".marca") %>% 
  xmlApply(xmlValue) %>% lapply(clean_up) %>% unlist

details <-   selectr::querySelectorAll(doc, ".detalle a") %>% 
  xmlApply(xmlValue) %>% 
  unlist

price <- selectr::querySelectorAll(doc, ".precio1") %>% 
  xmlApply(xmlValue) %>% lapply(clean_up) %>% unlist

as.data.frame(cbind(product, details, price))
      product                  details      price
1       Imaco  Termoventilador NF15...  S/. 44.90
2       Imaco  Ventilador de 10"  I...     S/. 69
3       Imaco  Ventilador Imaco de ...     S/. 89
4      Taurus  Recirculador TRA-30 ...     S/. 89
5       Imaco  Termoventilador ITC-...    S/. 109
6        Sole Termo Ventilador Elé...     S/. 99
7      Taurus  Ventilador TVP-40 3 ...     S/. 99
8       Imaco  Estufa OFR7AO 1.500 ...    S/. 129
9      Alfano  Ventilador Recircula...    S/. 139
10     Taurus  Ventilador TVC-40RC ...    S/. 139
11      Imaco  Ventilador Pedestal ...    S/. 149
12     Alfano  Ventilador Orbital 1...    S/. 149
13 Electrolux  Ventilador  de Mesa ... S/. 149.90
14     Alfano  Estufa Termoradiador...    S/. 159
15     Alfano  Ventilador Pared 18"...    S/. 169
16      Imaco     Termoradiador OFR9AO    S/. 179

您通常可能希望对结果进行一些初步清理.

You would normally probably want to do some initial cleaning of the results.

这篇关于R:Rvest - 得到了我不想要的隐藏文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆