html_table 不适用于长行 [英] html_table dont work with long row
问题描述
我正在尝试提取页面上的表格
I am trying to extract the table that is on the page
使用 html_table 和 rvest,但是第一行的第一个文本是表格的一部分,显然会导致与 html_table 的冲突.我留下代码
Using html_table and rvest, However the first text, first row, is part of the table and apparently is causing conflicts with html_table. I leave the code
#Library's
library(rvest)
library(XML)
url<-"http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI" #page
url<-read_html(url)
table<-html_nodes(url,"table") #read notes
table<-html_table(table,fill=TRUE) #write like table
而错误是
if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n 中的错误!= :需要 TRUE/FALSE 的缺失值另外:警告消息:在 lapply(ncols, as.integer) 中:强制引入的 NAs
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != : missing value where TRUE/FALSE needed In addition: Warning message: In lapply(ncols, as.integer) : NAs introduced by coercion
也许可以使用 html_text 编写,但我需要表格格式.
Maybe it could be written using html_text, but I need it in table format.
感谢任何帮助
推荐答案
这不是表格的大小,而是前两行中极其粗糙的节点.
It's not the size of the table but the extremely gnarly nodes in the first two rows.
所以,只需编辑出问题节点.
So, just edit out the problem nodes.
xml2
支持更广泛的 libxml2
操作,现在:
xml2
supports a much wider array of libxml2
operations, now:
library(rvest)
library(tidyverse)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI")
xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))
xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))
html_nodes(pg, xpath=".//table") %>%
html_table() %>%
.[[1]] %>%
as_tibble()
## # A tibble: 368 × 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 76675290-K AD RETAIL S.A. VI
## 2 98000000-1 ADMINISTRADORA DE FONDOS DE PENSIONES CAPITAL S.A. VI
## 3 98000100-8 ADMINISTRADORA DE FONDOS DE PENSIONES HABITAT S.A. VI
## 4 76240079-0 ADMINISTRADORA DE FONDOS DE PENSIONES CUPRUM S.A. VI
## 5 76762250-3 ADMINISTRADORA DE FONDOS DE PENSIONES MODELO S.A. VI
## 6 98001200-K ADMINISTRADORA DE FONDOS DE PENSIONES PLANVITAL S.A. VI
## 7 76265736-8 ADMINISTRADORA DE FONDOS DE PENSIONES PROVIDA S.A. VI
## 8 94272000-9 AES GENER S.A. VI
## 9 96566940-K AGENCIAS UNIVERSALES S.A. VI
## 10 91253000-0 AGRICOLA NACIONAL S.A.C. E I. VI
## # ... with 358 more rows
注意你可以这样做:
xml_remove(html_nodes(pg, xpath=".//table/tr[position() >= 1 and position() <=2]"))
而不是两个删除操作,但它几乎同样冗长,这里没有真正的性能提升.
instead of the two remove ops but it's almost as verbose and there's no real performance gain here.
这篇关于html_table 不适用于长行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!