雅虎财经头条新闻网站刮与R [英] Yahoo Finance Headlines webpage scraping with R

查看:126
本文介绍了雅虎财经头条新闻网站刮与R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用R下载任何雅虎财经头条网页的HTML代码,选择头条新闻并在Excel中收集它们。不幸的是,一旦我将源文件下载到R,我无法找到并选择与标题对应的HTML节点。



让我用一个例子来展示问题。
我以

 来源<  - http://finance.yahoo.com/q/h? s = AAPL +标题
file - destination / finance_file.cvs
download.file(url = source,destfile = file)
x = scan(file,what = ,产生Excel文件 finance_file,sep =\\\

cvs ,最重要的是,字符 x



使用 x 我想收集标题并将它们写入第二个Excel文件的列中, headlines.cvs



现在我的问题如下:如果我选择任何标题,我可以在网页本身的HTML代码,但我在 x 中失去了它的踪迹。因此,我不知道如何提取它。



对于提取我想到的是

pre > x = x [grep(做某工作的一些字符串,x)]

但我不擅长网络抓取。
任何想法/建议?



我非常感谢您!

解决方案您可以使用 XML 包并编写提取标题所需的XPath查询。



由于网页看起来像:

  ... 
< ul class =newsheadlines/>
< ul>
< li>< a href =...>第一个标题< / a>< / li>
...

您会收到以下查询。

  library(XML)
source< - http://finance.yahoo.com/q/h?s=AAPL+Headlines
d < - htmlParse(source)
xpathSApply(d,// ul [contains(@ class,'newsheadlines')] / following :: ul / li / a,xmlValue)
免费(d)


I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.

Let me show the problem with an example. I started with

source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
 x = scan(file, what = "", sep = "\n")

producing the Excel file finance_file.cvs and, most importantly, the character x.

Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.

My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.

For the extraction I was thinking of

x = x[grep("some string of characters to do the job", x)]

but I am no expert in web scraping. Any ideas/suggestions?

I thank you very much!

解决方案

You can use the XML package and write the XPath query needed to extract the headlines.

Since the web page looks like:

...
<ul class="newsheadlines"/>
<ul>
  <li><a href="...">First headline</a></li>
  ...

you get the following query.

library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)

这篇关于雅虎财经头条新闻网站刮与R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆