提取< tr>来自多个html文件的值 [英] Extracting <tr> values from multiple html files

查看：769 发布时间：2018/6/25 18:33:41 html r web-scraping rvest

本文介绍了提取< tr>来自多个html文件的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是网络报废新手。我有3000多个html / htm文件，我需要从它们中提取tr值并在数据框中进行转换以进行进一步分析。

我使用的代码是：

  html<  -  list.files（pattern =\\。（htm | html）$）
 
 mydata<  -  lapply（html，read_html）％>％
 html_nodes（tr）％>％
 html_text（）
  
 $ b  UseMethod中的错误（xml_find_all）：
没有适用于'xml_find_all'的方法应用于类character  
 
 
 我做错了什么？ 
 
 
为了提取数据框，我有这样的代码：
 
 $ $ p $  u < -  as.data.frame（matrix（mydata，byrow = TRUE），stringsAsFactors = FALSE）

预先感谢您。

解决方案   lapply 会输出文件列表。不能由 read_html 处理。取而代之的是在 lapply 中包含所有 rvest 操作：
 
 
 <$ p $ （html，function（）（）（）{code> html<  -  list.files（pattern =\\\。。（htm | html）$）
 
 mydata<  -  lapply文件）{
 read_html（文件）％>％html_nodes（'tr'）％>％html_text（）
}）

示例

我的WD内容中有文件

< html> < head>< / head> < body> < table> < tr>< td> Martin< / td>< / tr> < / table> < / body> < / html>
和
< HTML> < head>< / head> < body> < table> < tr>< td> Carolin< / td>< / tr> < / table> < / body> < / html>
会输出
> mydata [[1]] [1]Martin [[2]] [1]Carolin
在我的情况下，我可以使用

data.frame（Content = unlist（mydata））

I am new to web-scrapping. I have 3000+ html/htm files and I need to extract "tr" values from them and transform in a dataframe to do further analysis.

Codes which I have used is:
html <- list.files(pattern="\\.(htm|html)$") mydata <- lapply(html,read_html)%>% html_nodes("tr")%>% html_text()
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"

What I am doing wrong?

To extract in a dataframe, i have this code
u <- as.data.frame(matrix(mydata,byrow = TRUE),stringsAsFactors = FALSE)
Thank you in advance.
解决方案
lapply will output a list of documents. That cant be handled by read_html. Instead include all rvest actions in lapply:
html <- list.files(pattern="\\.(htm|html)$") mydata <- lapply(html, function(file) { read_html(file) %>% html_nodes('tr') %>% html_text() })

Example

Having two test files in my WD with content
<html> <head></head> <body> <table> <tr><td>Martin</td></tr> </table> </body> </html>
and
<html> <head></head> <body> <table> <tr><td>Carolin</td></tr> </table> </body> </html>
would output
> mydata [[1]] [1] "Martin" [[2]] [1] "Carolin"
In my case I could then format it using
data.frame(Content = unlist(mydata))

这篇关于提取< tr>来自多个html文件的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提取< tr>来自多个html文件的值 [英] Extracting <tr> values from multiple html files

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

提取&lt; tr&gt;来自多个html文件的值 [英] Extracting &lt;tr&gt; values from multiple html files

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

提取< tr>来自多个html文件的值 [英] Extracting <tr> values from multiple html files

登录关闭