如何将 rvest 应用于 HTML 的数据框列以制作一列提取的加粗词 [英] How to apply rvest to a dataframe column of HTML to make a column of extracted emboldened words

查看:29
本文介绍了如何将 rvest 应用于 HTML 的数据框列以制作一列提取的加粗词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中一列 - raw - 是 HTML:

<头>
其他列原始
第一行

6并且具有$ i 61.5 i 的累积赤字.截至 201 年 12 月 31 日 6 6 百万.为了实现可持续的盈利能力,我们必须增加收入.</i></font></p>

第二行<div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;>我们有亏损的历史,我们不能向您保证我们会实现盈利.</font></div>

我想在原始列之外构建一些新列.我想要每个通用样式属性(粗体、斜体、下划线等)一列 - 例如,is_bold 列中的每个条目要么是粗体",要么是粗体".或者只是空白.所以我最终想要的输出是这样的:

<头>
其他列原始is_boldis_italic
第一行

6并且具有$ i 61.5 i 的累积赤字.截至 201 年 12 月 31 日 6 6 百万.为了实现可持续的盈利能力,我们必须增加收入.</i></font></p>

斜体
第二行<div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;>我们有亏损的历史,我们不能向您保证我们会实现盈利.</font></div>粗体斜体

如上例所示,我的几个 HTML 段落在某些样式中有一些文本,而另一些则没有.例如.我的第一行有两个字符(55")以粗体显示,其余的没有,而整个段落是斜体 - 所以,如果说,HTML 的 text 至少有 50% 是以粗体显示,我想将该行标记为粗体.

所以,为了实现这个期望的输出,我想提取任何粗体文本,计算其组合长度(即使粗体部分分布在段落的不同部分),除以段落的总长度,如果此数字超过 0.5,则将该行标记为粗体.所以我的问题是:

  1. 如何在数据框设置中实现这一点?对于单个 html 字符串而不是数据框,以下代码有效:

html <-这里有一些 html"粗体部分 <- html %>% html_nodes("b, strong") %>% html_text()

因此,将此应用于我的数据框列,有人可以帮我弄清楚如何修改下面的代码以将任何加粗的单词提取到名为 bold_words 的新列中吗?从那里,我可以计算这些粗体字的长度,然后除以 raw 列的长度.

dataframe <- dataframe %>%rowwise() %>%变异(粗体字 = read_html(raw) %>%html_nodes("b,strong)%>%html_text())

  1. 一旦它起作用了,它应该适用于 .但是,我不知道如何将它应用到第 2 行中的 HTML - where 而不是 <b>或<i>或,外观由font-style:italic"、text-decoration:underline"决定.和font-weight:bold".我可以使用正则表达式在这些部分拆分它,但我更愿意解析 HTML.

  2. 如果有人发现更好的方法来做这件事,我们将不胜感激,即使这意味着使用完全不同的方法.

谢谢

解决方案

您可以使用带有 * contains 运算符的属性选择器来指定包含粗体的样式属性.

下面显示了创建一个粗略的通用函数,您可以将 css 模式和所需的列文本传递给给定的输出列.显示的是 is_boldis_italic 的模式.

TODO:您可能想要添加一些错误处理,例如以防出现 HTML 解析错误.

图书馆(tidyverse)图书馆(rvest)df <- 数据.frame(其他= c(第一行",第二行"),原始 = c('<p id=PARA339"样式=文本对齐:左;边距:0pt;线高:1.25">6并且具有$ i 61.5 i 的累积赤字.截至 201 年 12 月 31 日 6 6 百万.为了实现可持续的盈利能力,我们必须增加收入.</i></font></p>','<div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;>我们有亏损的历史,我们不能向您保证我们会实现盈利.</font></div>'))is_pattern <- function(i, css_selector, return_text) {页面 <- read_html(i)all_text <- nchar(page %>% html_text())模式文本 <- sum(nchar(page %>% html_nodes(css_selector) %>% html_text()))标志 <- ifelse(length(all_text) == 0 | length(pattern_text) == 0, F, (pattern_text/all_text) >= .5)返回(ifelse(标志,返回文本,''))}df$`is_bold` <- lapply(df$raw, is_pattern, 'b, strong, [style*=font-weight:bold"]', 'bold')


变异示例:

is_pattern <- Vectorize(is_pattern)df <-df%>%变异(is_bold = is_pattern(raw, 'b, strong, [style*=font-weight:bold"]', 'bold'),is_italic = is_pattern(raw, 'em, i, [style*=font-style:italic"]', 'italic'),)

我从@r2evans 的

I have a dataframe, of which one column - raw - is HTML:

other column raw
First row <p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>
Second row <div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div>

I would like to build some new columns off the raw column. I would like one column per common styling attribute (bold, italic, underlining etc.) - where each entry in the is_bold column, for example, is either "bold" or just blank. So my final desired output looks like this:

other column raw is_bold is_italic
First row <p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p> italic
Second row <div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div> bold italic

As demonstrated in the above example, several of my HTML paragraphs have some text in some styles, and others not. E.g. my first row has two characters ("55") in bold, and the rest not, while the whole paragraph is italic - so if, say, at least 50% of the text of the HTML is in bold, I'd want to label the row as bold.

So, to achieve this desired output, I want to extract any text that is in bold, count its combined length (even if the bold parts are spread across different parts of the paragraph), divide by the total length of the paragraph, and if this number exceeds 0.5, flag that row as being in bold. So my questions are:

  1. How do I implement this in a dataframe setting? For a single string of html rather than a dataframe, the following code works:

html <- "some html here"
bold_parts <- html %>% html_nodes("b, strong") %>% html_text()

So, applying this to my dataframe column, can someone please help me figure out how to modify the code below to extract any emboldened words to a new column called bold_words? From there, I can count the length of these bold words and divide it by the length of the raw column.

dataframe <- dataframe %>% 
  rowwise() %>% 
  mutate(
    bold_words = read_html(raw) %>%
      html_nodes("b, strong) %>%
      html_text()
    ) 

  1. Once this is working, it should be fine for styles defined by <b>, <strong>, <i>, <em>, and <u>. However, I am not sure how to go about applying this to HTML like that in row 2 - where instead of <b> or <i> or <u>, the appearance is determined by "font-style:italic", "text-decoration:underline" and "font-weight:bold". I could split it at these parts using regex, but I would rather parse the HTML.

  2. If anyone spots a better way of doing any of this, it'd be appreciated, even if it means using an entirely different approach.

Thank you

解决方案

You can use attribute selectors with * contains operator to specify the style attribute containing bold.

The following shows creating a crude general function you can pass your css pattern, and desired column text, into for a given output column. Shown are the patterns for is_bold and is_italic.

TODO: You probably want to add some error handling e.g. in case of HTML parsing errors.

library(tidyverse)
library(rvest)

df <- data.frame(
  other= c("First Row", "Second Row"),
  raw =  c(
    '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>',
    '<div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div>'
  )
)

is_pattern <- function(i, css_selector, return_text) {
  page <- read_html(i)
  all_text <- nchar(page %>% html_text())
  pattern_text <- sum(nchar(page %>% html_nodes(css_selector) %>% html_text()))
  flag <- ifelse(length(all_text) == 0 | length(pattern_text) == 0, F, (pattern_text / all_text) >= .5)
  return(ifelse(flag, return_text, ''))
}

df$`is_bold` <- lapply(df$raw, is_pattern, 'b, strong, [style*="font-weight:bold"]', 'bold')


mutate example:

is_pattern <- Vectorize(is_pattern)

df <- df %>%
  mutate(
    is_bold = is_pattern(raw, 'b, strong, [style*="font-weight:bold"]', 'bold'),
    is_italic = is_pattern(raw, 'em, i, [style*="font-style:italic"]', 'italic'),
  )

I noted from an answer by @r2evans that I needed to Vectorize the function.


这篇关于如何将 rvest 应用于 HTML 的数据框列以制作一列提取的加粗词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆