雅虎财经头条新闻网站刮与R [英] Yahoo Finance Headlines webpage scraping with R
问题描述
我想用R下载任何雅虎财经头条网页的HTML代码,选择头条新闻并在Excel中收集它们。不幸的是,一旦我将源文件下载到R,我无法找到并选择与标题对应的HTML节点。
让我用一个例子来展示问题。
我以
来源< - http://finance.yahoo.com/q/h? s = AAPL +标题
file - destination / finance_file.cvs
download.file(url = source,destfile = file)
x = scan(file,what = ,产生Excel文件 finance_file,sep =\\\
)
cvs ,最重要的是,字符 x
。
使用 x
我想收集标题并将它们写入第二个Excel文件的列中, headlines.cvs
。
现在我的问题如下:如果我选择任何标题,我可以在网页本身的HTML代码,但我在 x
中失去了它的踪迹。因此,我不知道如何提取它。
对于提取我想到的是
pre > x = x [grep(做某工作的一些字符串,x)]
但我不擅长网络抓取。
任何想法/建议?
我非常感谢您!
XML
包并编写提取标题所需的XPath查询。 由于网页看起来像:
...
< ul class =newsheadlines/>
< ul>
< li>< a href =...>第一个标题< / a>< / li>
...
您会收到以下查询。
library(XML)
source< - http://finance.yahoo.com/q/h?s=AAPL+Headlines
d < - htmlParse(source)
xpathSApply(d,// ul [contains(@ class,'newsheadlines')] / following :: ul / li / a,xmlValue)
免费(d)
I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.
Let me show the problem with an example. I started with
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
x = scan(file, what = "", sep = "\n")
producing the Excel file finance_file.cvs
and, most importantly, the character x
.
Using x
I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs
.
My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x
. Therefore, I do not know how to extract it.
For the extraction I was thinking of
x = x[grep("some string of characters to do the job", x)]
but I am no expert in web scraping. Any ideas/suggestions?
I thank you very much!
You can use the XML
package and write the XPath query needed to extract the headlines.
Since the web page looks like:
...
<ul class="newsheadlines"/>
<ul>
<li><a href="...">First headline</a></li>
...
you get the following query.
library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)
这篇关于雅虎财经头条新闻网站刮与R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!