使用 R 对房地产广告进行网页抓取 [英] Web scraping with R over real estate ads

查看:41
本文介绍了使用 R 对房地产广告进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为经济研究团队的实习生,我的任务是找到一种方法,使用 R 自动收集房地产广告网站上的特定数据.

As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.

我假设相关的包是XMLRCurl,但我对他们工作的理解非常有限.

I assume that the concerned packages are XML and RCurl, but my understanding of their work is very limited.

这是网站的主页:http:///www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000理想情况下,我想构建我的数据库,以便每一行都对应一个广告.

Here is the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000 Ideally, I'd like to construct my database so that each row corresponds to an ad.

以下是广告的详细信息:http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s我的变量是:价格(Prix")、城市(Ville")、表面(surface")、GES"、Classe énergie"和房间数量(Pièces"),以及作为广告中显示的图片数量.我还想以字符向量形式导出文本,稍后我将对其进行文本挖掘分析.

Here is the detail of an ad: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s My variables are: the price ("Prix"), the city ("Ville"), the surface ("surface"), the "GES, the "Classe énergie" and the number of room ("Pièces"), as well as the number of pictures shown in the ad. I would also like to export the text in a character vector over which I would perform a text mining analysis later on.

我正在寻找任何帮助、指向教程或操作方法的链接,它们可以为我提供指导.

I'm looking for any help, link to a tutorial or How-to that would give me a lead over the path to follow.

推荐答案

您可以使用 R 中的 XML 包来抓取这些数据.这是一段应该有帮助的代码.

You can use the XML package in R to scrape this data. Here is a piece of code that should help.

# DEFINE UTILITY FUNCTIONS

# Function to Get Links to Ads by Page
get_ad_links = function(page){
  require(XML)
  # construct url to page
  url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
  url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
  page     = htmlTreeParse(url, useInternalNodes = T)

  # extract links to ads on page
  xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
  ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
  return(ad_links)  
}

# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
   require(XML)
   # parse ad url to html tree
   doc = htmlTreeParse(ad_url, useInternalNodes = T)

   # extract labels and values using xpath expression
   labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
   values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
   values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
   values  = c(values1, values2)

   # convert to data frame and add labels
   mydf        = as.data.frame(t(values))
   names(mydf) = labels
   return(mydf)
}

以下是如何使用这些函数将信息提取到数据框中.

Here is how you would use these functions to extract information into a data frame.

# grab ad links from page 1
ad_links = get_ad_links(page = 1)

# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')

这将返回以下输出

Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)

您可以轻松地使用 apply 函数系列遍历多个页面以获取所有广告的详细信息.有两件事要注意.一是从网站上抓取的合法性.二是在循环函数中使用Sys.sleep,这样服务器就不会受到请求的轰炸.

You can easily use the apply family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep in your looping function so that the servers are not bombarded with requests.

告诉我这是如何工作的

这篇关于使用 R 对房地产广告进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆