雅虎财经头条新闻网站刮与R [英] Yahoo Finance Headlines webpage scraping with R

查看：126 发布时间：2018/6/21 17:28:04 html r web web-scraping

本文介绍了雅虎财经头条新闻网站刮与R的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用R下载任何雅虎财经头条网页的HTML代码，选择头条新闻并在Excel中收集它们。不幸的是，一旦我将源文件下载到R，我无法找到并选择与标题对应的HTML节点。

让我用一个例子来展示问题。
我以

 来源<  - http://finance.yahoo.com/q/h？ s = AAPL +标题
 file - destination / finance_file.cvs
 download.file（url = source，destfile = file）
x = scan（file，what = ，产生Excel文件 finance_file，sep =\\\
）

cvs ，最重要的是，字符 x 。

使用 x 我想收集标题并将它们写入第二个Excel文件的列中， headlines.cvs 。

现在我的问题如下：如果我选择任何标题，我可以在网页本身的HTML代码，但我在 x 中失去了它的踪迹。因此，我不知道如何提取它。

对于提取我想到的是

pre > x = x [grep（做某工作的一些字符串，x）]
但我不擅长网络抓取。
任何想法/建议？

我非常感谢您！

解决方案您可以使用 XML 包并编写提取标题所需的XPath查询。

由于网页看起来像：

... < ul class =newsheadlines/> < ul> < li>< a href =...>第一个标题< / a>< / li> ...
您会收到以下查询。
library（XML） source< - http://finance.yahoo.com/q/h?s=AAPL+Headlines d < - htmlParse（source） xpathSApply（d，// ul [contains（@ class，'newsheadlines'）] / following :: ul / li / a，xmlValue）免费（d）

I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.

Let me show the problem with an example. I started with
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines" file <- "destination/finance_file.cvs" download.file(url = source, destfile = file) x = scan(file, what = "", sep = "\n")
producing the Excel file finance_file.cvs and, most importantly, the character x.

Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.

My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.

For the extraction I was thinking of
x = x[grep("some string of characters to do the job", x)]
but I am no expert in web scraping. Any ideas/suggestions?

I thank you very much!
解决方案
You can use the XML package and write the XPath query needed to extract the headlines.

Since the web page looks like:
... <ul class="newsheadlines"/> <ul> <li><a href="...">First headline</a></li> ...
you get the following query.
library(XML) source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines" d <- htmlParse(source) xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue) free(d)

这篇关于雅虎财经头条新闻网站刮与R的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

雅虎财经头条新闻网站刮与R [英] Yahoo Finance Headlines webpage scraping with R

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

雅虎财经头条新闻网站刮与R [英] Yahoo Finance Headlines webpage scraping with R

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭