R中的网页抓取表 [英] Web Scraping tables in R

查看:35
本文介绍了R中的网页抓取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

完全的菜鸟试图在这个页面上刮桌子,我得到的最远的是加载 rvest 包.我的问题是:

Complete noob trying to scrape the table on this page, the furthest I've got is loading the rvest package. My problems are:

  1. 我找不到正确的元素;我通过检查员尝试的元素是table.w782.comm.lsjz";但它返回一个长度为 0 的列表,并在 html_table() 之后执行 %>% .[[1]] ,即 fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table() %>% .[[1]] 也不起作用
  1. I couldn't find the right element; the element I tried via inspector is "table.w782.comm.lsjz" but it returns a list of length 0, and doing %>% .[[1]] after html_table() ie fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table() %>% .[[1]] doesn't work either

(.[[1]] 中的错误:下标越界)

(Error in .[[1]] : subscript out of bounds)

fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)
fund_table <- fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table()

  1. 该表格有多个页面 (113),但单击第 2 页不会重新加载 html,因此我不知道如何将所有 113 页的数据刮到一页上...

真的很感谢任何关于我能做什么的指示......

Really appreciate any pointers as to what I could do...

推荐答案

在您最初的问题中,问题是该表显示为脚本而不是有效的 xml/html 表.使用您获得的 API 链接绝对是您要走的路.

In your original question the problem is that the table is showing up as a script instead of a valid xml/html table. Using the API link you got was definately the way to go.

library(rvest)

# You gave an API link and this is the best option for getting the data.
fund_link <- "https://fundf10.eastmoney.com/F10DataApi.aspx?type=lsjz&code=510300&page=1&sdate=2019-01-01&edate=2021-02-13&per=40"
fund_page <- read_html(fund_link)

# Any of these seem to work
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.comm") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.lsjz") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782.comm.lsjz") %>% html_table() %>% .[[1]]


# Original Question:
fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)

# The following doesn't work because the table you want is actually a script, not a table.
# <script id="lsjzTable" type="text/html">
#   {{if Data && Data.LSJZList}}
# <table class="w782 comm lsjz">
#   <thead>
#   <tr>
#   <th class="first"><U+51C0><U+503C><U+65E5><U+671F></th>
#   {{if ((Data.FundType!="004" && Data.FundType!="005") || "510300"=="511880")}}
# <th><U+5355><U+4F4D><U+51C0><U+503C></th>
#   <th><U+7D2F><U+8BA1><U+51C0><U+503C></th>
#   {{if Data.FundType=="100"}}
# <th><U+5468><U+589E><U+957F><U+7387></th>
#   {{else}}
# <th><U+65E5><U+589E><U+957F><U+7387><img id="jjjzTip" style="position: relative; top: 3px; left: 3px;" data-html="true" data-placement="bottom" title="<U+65E5><U+589E><U+957F><U+7387><U+4E3A><U+7A7A><U+539F><U+56E0><U+5982><U+4E0B>:<br>1<U+3001><U+975E><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+4E0D><U+53C2><U+4E0E><U+65E5><U+589E><U+957F><U+7387><U+8BA1><U+7B97>(<U+7070><U+8272><U+6570><U+636E><U+884C>)<U+3002><br>2<U+3001><U+4E0A><U+4E00><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+672A><U+62AB><U+9732>,<U+65E5><U+589E><U+957F><U+7387><U+65E0><U+6CD5><U+8BA1><U+7B97><U+3002>" src="//j5.dfcfw.com/image/201307/20130708102440.gif"></th>
#   {{/if}}
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]

# The following is a partial solution but isn't fully working.
fund_table <- fund_page %>% 
  html_nodes("script#lsjzTable") %>%
  as.character(.) %>%
  stringr::str_remove_all("\\{\\{.+?\\}\\}") %>%
  stringr::str_remove_all("\\<\\/?script.*?\\>") %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table()

这篇关于R中的网页抓取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆