R - 检查 HTML 的格式标签(粗体、斜体等) [英] R - checking HTML for formatting tags (bold, italics etc.)

查看:27
本文介绍了R - 检查 HTML 的格式标签(粗体、斜体等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 edgarWebR 来解析 10K (SEC EDGAR) 文件.我正在尝试编写一种算法,通过检查文档的格式来推断每个 HTML 元素是普通文本、副标题还是标题(例如,某些 10K 可能将所有标题都以粗斜体显示,而副标题仅以斜体显示)

I am using edgarWebR to parse 10K (SEC EDGAR) filings. I am trying to write an algorithm to deduce whether each HTML element is normal text, a subheading or a heading by checking how the document is formatted (e.g. some 10Ks might have all headings in bold italics, and subheadings in just italics)

edgarWebR 返回一个数据框,每个元素对应一行,包含文本和 html.一些html的例子:

edgarWebR returns a dataframe with each element corresponding to a row, containing the text and html. An example of some html:

<p style="margin-top:18px;margin-bottom:0px"><font style="font-family:ARIAL"size=2"><b><i>我们的季度经营业绩过去一直在波动,并可能继续波动,导致我们普通股的价值大幅下降.</i></b></font></p>

正如我们所见,上面的内容应该被标记为粗体和斜体.然而,这在不同的文件中表现不同.例如,该文件使用 <b> 表示粗体,而有些则使用 font-weight = bold 之类的内容.

As we can see, the above should be flagged as bold and italic. However, this is represented differently in different filings. For example, this filing uses <b> to denote bold, whereas some say something like font-weight = bold.

处理这个问题的最佳方法是什么?是否有一个 R 包可以解析 HTML 并告诉我它是粗体和斜体,或者返回专门格式化标签(不是 span、p 等)的标签列表.

What is the best way to deal with this? Is there an R package that will parse the HTML and either tell me that it is bold and italic, or return a list of tags which are specifically formatting tags (not span, p etc).

或者,我如何根据手动编译的粗体和斜体指标列表(bold"、、strong)检查每一行,并让它返回列出每行匹配的列表?

Alternatively, how can i check each row against a manually compiled list of indicators of bold and italic ("bold", <b>, strong) and have it return any elements of the list which are matched for each row?

最后,我计划将值制成表格以确定标题级别.例如.如果我计算 100 个既没有粗体也没有斜体的元素,20 个只有 的元素,以及 10 个包含 和Italic"的元素,我可以推断粗体和斜体表示此特定文件的标题,而单独的粗体表示副标题.

At the end, I plan to tabulate values to determine heading levels. E.g. if I count 100 elements with neither bold nor italic, 20 elements with just <b>, and 10 elements containing <b> and "Italic", I can deduce that bold and italic represents headings for this particular filing, and bold alone denotes subheadings.

推荐答案

我认为您要查找的只是某个特定字符串是否包含 html 标记,该标记指示该字符串中的某些内容应该是粗体和/或斜体.

I think all you're looking for is if a particular string contains html markup that indicates something in that string should be bold and/or italics.

S <- '<p style="margin-top:18px;margin-bottom:0px"><font style="font-family:ARIAL" size="2"><b><i>Our quarterly operating results have fluctuated in the past and might continue to fluctuate, causing the value of our common stock to decline substantially. </i></b></font></p>'
grepl("<b>|<font-weight\\s*=\\s*bold", S, ignore.case = TRUE)
# [1] TRUE
grepl("<i>|<font-style\\s*=\\s*italic", S, ignore.case = TRUE)
# [1] TRUE

这篇关于R - 检查 HTML 的格式标签(粗体、斜体等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆