Rvest - 使用 html 数据框而不是网页 - 并提取格式标签 [英] Rvest - using a dataframe of html rather than a webpage - and extracting formatting tags

查看:35
本文介绍了Rvest - 使用 html 数据框而不是网页 - 并提取格式标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从一列 HTML 中提取格式化标签(然后继续记录每一行是否是粗体、斜体、什么颜色等)我试图弄清楚是使用正则表达式还是 HTML 解析器,并指向了rvest的方向.但是,我似乎无法弄清楚如何让它从数据帧的列中解析而不是转到 URL.此外,任何人都可以提供一些基本代码来提取 HTML 中存在的任何格式标签(甚至是所有标签/属性的列表,我可以从中过滤到手动编译列表中的相关标签).

HTML 类型示例,我需要从中获取字体大小、字体类型、字体颜色、背景以及斜体这一事实:

解决方案

一个可能的解决方案不是使用 rvest,而是使用 XML-package 可能如下:

htmlstring <- '

然后你可以使用 XPath 来找出你需要的东西,例如斜体部分:

XML::getNodeSet(htmlstring, '//i')

I am trying to extract formatting tags from a column of HTML (and then go on to record whether each row is bold, italic, what colour etc.) I was trying to figure out whether to use regex or an HTML parser, and was pointed in the direction of rvest. However, I can't seem to figure out how to get it to parse from a column of a dataframe as opposed to going to a URL. Also, can anyone provide some basic code for extracting any formatting tags present in the HTML (or even a list of all tags/attributes, from which I can filter to only the relevant ones from a manually compiled list).

Example of the sort of HTML, from which I would need the font size, font type, font colour, background, and the fact that it is italic:

<div align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: #FFFFFF"> These forward-looking statements are also affected by the risk factors described below in Part I, Item 1A ("Risk Factors") and those set forth from time to time in our filings with the Securities and Exchange Commission ("SEC"), which are available through our website at <i>www.exterran.com </i>and through the SEC's Electronic Data Gathering and Retrieval System ("EDGAR") at <i><u>www.sec.gov</u></i>. Important factors that could cause our actual results to differ materially from the expectations reflected in these forward-looking statements include, among other things: </div>

解决方案

A possible solution not with rvest, but with the XML-package could be the following:

htmlstring <- '<div align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; font-size: 10pt; font-family: \'Times New Roman\', Times; color: #000000; background: #FFFFFF"> These forward-looking statements are also affected by the risk factors described below in Part I, Item 1A ("Risk Factors") and those set forth from time to time in our filings with the Securities and Exchange Commission ("SEC"), which are available through our website at <i>www.exterran.com </i>and through the SEC\'s Electronic Data Gathering and Retrieval System ("EDGAR") at <i><u>www.sec.gov</u></i>. Important factors that could cause our actual results to differ materially from the expectations reflected in these forward-looking statements include, among other things: </div>'

htmlstring <- XML::htmlParse(htmlstring)

And then you can use XPath to find out what you need, e.g. italicized parts:

XML::getNodeSet(htmlstring, '//i')

这篇关于Rvest - 使用 html 数据框而不是网页 - 并提取格式标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆