使用 R 从网页中提取链接 [英] Extract Links from Webpage using R

查看:43
本文介绍了使用 R 从网页中提取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的两篇文章是从网站中提取数据并将其解析为 R 的不同方法的很好的例子.

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.

将 html 表格抓取到 R 数据框使用 XML 包

我如何使用 R(Rcurl/XML 包?!)抓取这个网页

我对编程很陌生,刚开始接触 R,所以我希望这个问题很基本,但鉴于上面的那些帖子,我想是的.

I am very new to programming, and am just starting out with R, so I am hoping this question is pretty basic, but given those posts above, I imagine that it is.

我要做的就是提取与给定模式匹配的链接.我觉得我可能可以使用 RCurl 读取网页并使用字符串表达式提取它们的蛮力方法.也就是说,如果网页格式相当好,我将如何使用 XML 包来这样做.

All I am looking to do is extract links that match a given pattern. I feel like I could probably use RCurl to read in the web pages and extract them brute force method using string expressions. That said, if the webpage is fairly well formed, how would I go about doing so using the XML package.

随着我了解更多,我喜欢在解决问题时查看"数据.问题是其中一些方法会生成列表列表等,因此新手(如我)很难走过我需要去的地方.

As I learn more, I like to "look" at the data as I work through the problem. The issue is that some of these approaches generate lists of lists of lists, etc., so it is hard for someone that is new (like me) to walk through where I need to go.

同样,我对所有编程都非常陌生,因此将不胜感激任何帮助或代码片段.

Again, I am very new to all that is programming, so any help or code snippets will be greatly appreciated.

推荐答案

htmlTreeParse 的文档显示了一种方法.这是另一个:

The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)

(您可以通过将links"传递给as.vector"来从返回的链接中删除href"属性.)

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

我之前的回复:

一种方法是使用 Hadley Wickham 的 stringr 包,您可以使用 install.packages("stringr", dep=TRUE) 安装该包.

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> html <- paste(readLines(url), collapse="\n")
> library(stringr)
> matched <- str_match_all(html, "<a href=\"(.*?)\"")

(我猜有些人可能不赞成在此处使用正则表达式.)

(I guess some people might not approve of using regexp's here.)

matched 是一个矩阵列表,向量 html 中的每个输入字符串一个矩阵——因为这里的长度为 1,matched 只有一个元素.第一个捕获组的匹配项位于该矩阵的第 2 列(通常,第 i 个组将出现在第 (i + 1) 列中).

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2]
> head(links)
[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
[2] "http://careers.stackoverflow.com"                                                  
[3] "http://meta.stackoverflow.com"                                                     
[4] "/about"                                                                            
[5] "/faq"                                                                              
[6] "/"

这篇关于使用 R 从网页中提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆