使用 rvest 抓取网站 - 选择 html 节点? [英] Using rvest to scrape a website - Selecting html node?
问题描述
我对我最新的 r 背心刮擦有疑问.
I have a question about my latest r vest scrape.
我想抓取这个页面(以及其他一些股票):http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1
I want to scrape this page (and some other stocks as well): http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1
我需要市场资本的列表,这是第二行的第一个框.此列表应包含大约 50-100 只股票.
I need a list of the Market Capital, which is the first box in the second line. This list should contain approx 50-100 stocks.
我为此使用了 rvest.
I am using rvest for that.
library(rvest)
html = read_html("http://www.finviz.com/quote.ashx?t=A")
cast = html_nodes(html, "table-dark-row")
问题是,我无法绕过 html_nodes.知道如何找出 html_nodes 的正确节点吗?
The problem is, I can not get around the html_nodes. Any idea about how to find out the correct node for the html_nodes?
我正在使用 firebug/firefinder 查看网页.
I am using firebug/firefinder to check out the webpage.
推荐答案
不确定这是否是您想要的,因为我找不到带有 aprox 的列表.50-100 只股票.
Not sure if this is what you want because I cannot find a list with aprox. 50-100 stocks.
但是为了什么是值得的,使用 SelectorGadget 我想出了这个节点 .table-dark-row:nth-child(2) .snapshot-td2:nth-child(2),选择市值(本页第二行的第一个框http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1).
But for what is worth, using SelectorGadget I came up with this node .table-dark-row:nth-child(2) .snapshot-td2:nth-child(2), to select the Market Cap (first box in the second line of this page http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1).
> library(rvest)
>
> html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
>
> cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
> cast
{xml_nodeset (1)}
[1] <td width="8%" class="snapshot-td2" align="left">\n <b>11.58B</b>\n</td>
>
如果这不是您想要的,只需使用 SelectorGadget 找到您想要的.
If this is not exactly what you want, just use SelectorGadget to locate what you want.
希望这会有所帮助.
这里是完整的解决方案:
Here complete solution:
library(rvest)
html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
html_text(cast) %>%
gsub(pattern = "B", replacement = "") %>%
as.numeric()
这篇关于使用 rvest 抓取网站 - 选择 html 节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!