R中的网页抓取 [英] web scraping in R

查看:34
本文介绍了R中的网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取发布日期"和更新日期"的值如图所示.网站网址是:http://sulit.com.ph/3991016

I'm trying to get the values of 'Dated Posted' and 'Date Updated' as pictured here. The website url is: http://sulit.com.ph/3991016

我有一种感觉,我应该使用 xpathSApply,正如此线程中所建议的那样 Web Scraping(在 R 中?),但我无法让它工作.

I have a feeling I should be using xpathSApply, as suggested in this thread Web Scraping (in R?), but I just can't get it to work.

url = "http://sulit.com.ph/3991016"
doc = htmlTreeParse(url, useInternalNodes = T)

date_posted = xpathSApply(doc, "??????????", xmlValue)

还有谁知道快速获取网站中列出的短语P27M"的方法吗?帮助将不胜感激.

Also does anyone know a quick way to get the phrase 'P27M' also listed in the website? Help would be appreciated.

推荐答案

这是另一种方法.

> require(XML)
> 
> url = "http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE"
> doc = htmlParse(url)
> 
> dates = getNodeSet(doc, "//span[contains(string(.), 'Date Posted') or contains(string(.), 'Date Updated')]")
> dates = lapply(dates, function(x){
+         temp = xmlValue(xmlParent(x)["span"][[2]])
+         strptime(gsub("^[[:space:]]+|[[:space:]]+$", "", temp), format = "%B %d, %Y")
+ 
+ })
> dates
[[1]]
[1] "2012-07-05"

[[2]]
[1] "2011-08-11"

没有必要使用 RCurl,因为 htmlParse 会解析 url.getNodeSet 将返回一个列表,其中包含具有发布日期"或更新日期"作为值的节点.lapply 在这两个节点上循环,首先找到父节点,然后是第二个跨度"节点的值.如果网站更改了不同页面的格式(在查看该网站的 html 后似乎很有可能),这部分可能不是很健壮.SlowLearner 的 gsub 清理两个日期.我添加了 strptime 以将日期作为日期类返回,但该步骤是可选的,并且取决于您将来打算如何使用这些信息.HTH

There's no need to use RCurl as htmlParse will parse urls. getNodeSet will return a list with the nodes that have "Date Posted" or "Date Updated" as values. The lapply loops over both of those nodes and first finds the parent node then the value of the second "span" node. This part may not be very robust if the website changes its formatting for different pages (which after looking at the html for that site seems very possible). SlowLearner's gsub cleans up both dates. I added strptime to return the dates as a date class, but that step is optional and depends on how you plan to use the info in the future. HTH

这篇关于R中的网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆