如何从多个“ div class”中提取文本？（html）使用R？ [英] How to extract text from a several "div class" (html) using R?

查看：333 发布时间：2020/10/12 2:48:53 html css r regex rvest

本文介绍了如何从多个“ div class”中提取文本？（html）使用R？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的目标是从此html页面提取信息以创建数据库：
https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

变量之一是公寓。我发现某些代码具有 div class = row_price 代码，其中包含价格（示例A），而其他代码则没有此代码，因此也没有价格（示例B）。因此，我希望R可以在没有价格的情况下以 NA 的价格读取观察结果，否则它将通过提供来自随后观察结果的价格来混合数据库。

One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>

示例B

Example B

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

我认为如果我从 listing_row_price中提取文本到 row_info的开头，一个字符向量，我将能够得到所需的输出，即：

I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:

但是到目前为止，我已经用 NA 来得到一个和另一个。

But so far I've get this one and another full with NA.

使用了命令但没有得到我想要的命令：

Commands used but didn't get what I want:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

我也尝试使用 pattern1<-'listing_row_price>（[[< div class = row_price>]？）（[^^<] *）< 我认为它表示要提取 listing_row_price部分，然后如果存在，则提取 row_price部分，以后再获取数字，并最终提取< 。

I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.

推荐答案

有很多方法可以做到这一点，并且取决于HTML的一致性，一个可能比另一个更好。不过，在这种情况下可以使用的一种相当简单的策略：

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:

library(rvest)

page <- read_html('page.html')

# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

这篇关于如何从多个“ div class”中提取文本？（html）使用R？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从多个“ div class”中提取文本？（html）使用R？ [英] How to extract text from a several "div class" (html) using R?

问题描述

示例B

Example B

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何从多个“ div class”中提取文本？ （html）使用R？ [英] How to extract text from a several &quot;div class&quot; (html) using R?

问题描述

示例B

Example B

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

如何从多个“ div class”中提取文本？（html）使用R？ [英] How to extract text from a several "div class" (html) using R?

登录关闭