如何从多个“ div class”中提取文本? (html)使用R? [英] How to extract text from a several "div class" (html) using R?

查看:333
本文介绍了如何从多个“ div class”中提取文本? (html)使用R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是从此html页面提取信息以创建数据库:
https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

变量之一是公寓。我发现某些代码具有 div class = row_price 代码,其中包含价格(示例A),而其他代码则没有此代码,因此也没有价格(示例B)。因此,我希望R可以在没有价格的情况下以 NA 的价格读取观察结果,否则它将通过提供来自随后观察结果的价格来混合数据库。

One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>



示例B



Example B

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

我认为如果我从 listing_row_price中提取文本到 row_info的开头,一个字符向量,我将能够得到所需的输出,即:

I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:

...
10 4000
11 14800
12 NA
13 14000
14 8000
...

但是到目前为止,我已经用 NA 来得到一个和另一个。

But so far I've get this one and another full with NA.

...
10 4000
11 14800
12 14000
13 8000
14 8500
...

使用了命令但没有得到我想要的命令:

Commands used but didn't get what I want:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

我也尝试使用 pattern1<-'listing_row_price>([[< div class = row_price>]?)([^^<] *)< 我认为它表示要提取 listing_row_price部分,然后如果存在,则提取 row_price部分,以后再获取数字,并最终提取<

I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.

推荐答案

有很多方法可以做到这一点,并且取决于HTML的一致性,一个可能比另一个更好。不过,在这种情况下可以使用的一种相当简单的策略:

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:

library(rvest)

page <- read_html('page.html')

# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

这篇关于如何从多个“ div class”中提取文本? (html)使用R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆