R - 检查 HTML 的格式标签(粗体、斜体等) [英] R - checking HTML for formatting tags (bold, italics etc.)

查看：27 发布时间：2021/9/24 19:01:31 html r web-scraping edgar sec

本文介绍了R - 检查 HTML 的格式标签(粗体、斜体等)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 edgarWebR 来解析 10K (SEC EDGAR) 文件.我正在尝试编写一种算法，通过检查文档的格式来推断每个 HTML 元素是普通文本、副标题还是标题(例如，某些 10K 可能将所有标题都以粗斜体显示，而副标题仅以斜体显示)

I am using edgarWebR to parse 10K (SEC EDGAR) filings. I am trying to write an algorithm to deduce whether each HTML element is normal text, a subheading or a heading by checking how the document is formatted (e.g. some 10Ks might have all headings in bold italics, and subheadings in just italics)

edgarWebR 返回一个数据框，每个元素对应一行，包含文本和 html.一些html的例子:

edgarWebR returns a dataframe with each element corresponding to a row, containing the text and html. An example of some html:

我们的季度经营业绩过去一直在波动，并可能继续波动，导致我们普通股的价值大幅下降.

正如我们所见，上面的内容应该被标记为粗体和斜体.然而，这在不同的文件中表现不同.例如，该文件使用  表示粗体，而有些则使用 font-weight = bold 之类的内容.

As we can see, the above should be flagged as bold and italic. However, this is represented differently in different filings. For example, this filing uses  to denote bold, whereas some say something like font-weight = bold.

处理这个问题的最佳方法是什么?是否有一个 R 包可以解析 HTML 并告诉我它是粗体和斜体，或者返回专门格式化标签(不是 span、p 等)的标签列表.

What is the best way to deal with this? Is there an R package that will parse the HTML and either tell me that it is bold and italic, or return a list of tags which are specifically formatting tags (not span, p etc).

或者，我如何根据手动编译的粗体和斜体指标列表(bold"、、strong)检查每一行，并让它返回列出每行匹配的列表?

Alternatively, how can i check each row against a manually compiled list of indicators of bold and italic ("bold", , strong) and have it return any elements of the list which are matched for each row?

最后，我计划将值制成表格以确定标题级别.例如.如果我计算 100 个既没有粗体也没有斜体的元素，20 个只有的元素，以及 10 个包含 和Italic"的元素，我可以推断粗体和斜体表示此特定文件的标题，而单独的粗体表示副标题.

At the end, I plan to tabulate values to determine heading levels. E.g. if I count 100 elements with neither bold nor italic, 20 elements with just , and 10 elements containing  and "Italic", I can deduce that bold and italic represents headings for this particular filing, and bold alone denotes subheadings.

推荐答案

我认为您要查找的只是某个特定字符串是否包含 html 标记，该标记指示该字符串中的某些内容应该是粗体和/或斜体.

I think all you're looking for is if a particular string contains html markup that indicates something in that string should be bold and/or italics.

S <- 'Our quarterly operating results have fluctuated in the past and might continue to fluctuate, causing the value of our common stock to decline substantially. ' grepl("|<font-weight\\s*=\\s*bold", S, ignore.case = TRUE) # [1] TRUE grepl("|<font-style\\s*=\\s*italic", S, ignore.case = TRUE) # [1] TRUE

这篇关于R - 检查 HTML 的格式标签(粗体、斜体等)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R - 检查 HTML 的格式标签(粗体、斜体等) [英] R - checking HTML for formatting tags (bold, italics etc.)

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

R - 检查 HTML 的格式标签(粗体、斜体等) [英] R - checking HTML for formatting tags (bold, italics etc.)

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭