网络抓取 RStudio 中 VIN 号码的品牌/型号/年份 [英] Web scraping the make/model/year of VIN numbers in RStudio

查看:45
本文介绍了网络抓取 RStudio 中 VIN 号码的品牌/型号/年份的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开展一个项目,我需要在该项目中找到 VIN 编号的制造商、型号和年份.我有一个包含 300 个不同 VIN 号码的列表.遍历每个单独的 VIN 编号并手动将制造商、型号和年份输入 excel 非常低效且乏味.

I am currently working on a project where I need to find the manufacturer, model, and year of VIN numbers. I have a list of 300 different VIN numbers. Going through each individual VIN number and manually inputting the manufacturer, model, and year into excel is very inefficient and tedious.

我尝试使用带有 SelectorGadget 的 Rvest 包在 R 中编写几行代码,以便抓取此站点以获取信息,但我没有成功:http://www.vindecoder.net/?vin=1G2HX54K724118697&submit=Decode

I have tried using the Rvest packages with SelectorGadget to write a few lines of code in R in order to scrape this site to obtain the information but I was not successful: http://www.vindecoder.net/?vin=1G2HX54K724118697&submit=Decode

这是我的代码:

library("rvest")
Vnum = "1G2HX54K724118697"
site <- paste("http://www.vindecoder.net/?vin=", Vnum,"&submit=Decode",sep="")
htmlpage <- html(site)
VINhtml <- html_nodes(htmlpage, ".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")
VIN <- html_text(forecasthtml)
paste(forecast, collapse =" ")

当我尝试运行 VINhtml 时,我收到错误消息:列表()attr(,"类")[1] "XMLNodeSet"

When I try to run VINhtml, I get the error message: list() attr(,"class") [1] "XMLNodeSet"

我不知道我做错了什么.我认为它不起作用,因为它是一个动态网页,但我可能是错的.有人对解决此问题的最佳方法有什么建议吗?

I do not know what I am doing wrong. I think it is not working because it is a dynamic webpage but I could be wrong. Does anyone have any suggestions on the best way to approach this problem?

我也愿意使用其他网站或替代方法来解决这个问题.我只想找到这些 VIN 的型号、制造商和型号年份.任何人都可以帮我找到一种有效的方法吗?

I am also open to using other websites or alternative approaches to figuring this out. I just want to find the model, manufacturer, and model year of these VINs. Can anyone please help me in finding an efficient way of doing this?

以下是一些示例 VIN:YV4SZ592561226129YV4SZ592371288470YV4SZ592371257784YV4CZ982871331598YV4CZ982581428985YV4CZ982481423003YV4CZ982381423543YV4CZ982171380593YV4CZ982081460887YV4CZ852361288222YV4CZ852281454409YV4CZ852281454409YV4CZ852281454409YV4CZ592861304665YV4CZ592861267682YV4CZ592561266859

Here is some sample VINs: YV4SZ592561226129 YV4SZ592371288470 YV4SZ592371257784 YV4CZ982871331598 YV4CZ982581428985 YV4CZ982481423003 YV4CZ982381423543 YV4CZ982171380593 YV4CZ982081460887 YV4CZ852361288222 YV4CZ852281454409 YV4CZ852281454409 YV4CZ852281454409 YV4CZ592861304665 YV4CZ592861267682 YV4CZ592561266859

推荐答案

这里是使用 RSeleniumrvest 的解决方案.

Here is the solution using RSelenium and rvest.

要运行 RSelenium,您必须首先从这里下载 selenium 服务器(我的是 2.45 版本).假设下载的文件在我的文档目录中.然后,您必须在 cmd 中运行以下两个步骤,然后才能在 IDE 中运行 RSelenium.
在cmd中输入以下内容:a) cd My Documents # 我在我的文档文件夹中安装了 selenium 驱动程序b) 然后输入:java -jar selenium-server-standalone-2.45.0.jar

To run RSelenium, you have to first download selenium server from here (Mine is 2.45 version). Let's say the downloaded file is in My Documents directory. Then, you have to run following two steps in cmd before running RSelenium in IDE.
Type following in cmd: a) cd My Documents # I have selenium driver installed in My Documents folder b) and then type: java -jar selenium-server-standalone-2.45.0.jar

library(RSelenium)
library(rvest) 
startServer() 
remDr <- remoteDriver(browserName = 'firefox')
remDr$open()
Vnum<- c("YV4SZ592371288470","1G2HX54K724118697","YV4SZ592371288470")

kk<-lapply(Vnum,function(j){

  remDr$navigate(paste("http://www.vindecoder.net/?vin=",j,"&submit=Decode",sep=""))
  Sys.sleep(30) # this is critical
  test.html <- html(remDr$getPageSource()[[1]]) # this is RSelenium but after this we can use rvest functions until we close the session
  test.text<-test.html%>%
  html_nodes(".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")%>%
  html_text()
})
kk
[[1]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"                         

[[2]]
[1] "Model: Bonneville"            "Make (Manufacturer): Pontiac" "Model year: 2002"            

[[3]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"   

remDr$close()

附言您可以看到相同的 css 路径并不适用于所有 VIN.您必须提前弄清楚(我只是使用了您在问题中提供的路径).您可以使用某种tryCatch.

P.S. You can see that the same css path is not applicable for all VINs. You have to figure out that in advance (I just used the path that you provided in the question). You can use some sort of tryCatch.

这篇关于网络抓取 RStudio 中 VIN 号码的品牌/型号/年份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆