Rvest 阅读分离文章数据 [英] Rvest reading separated article data

查看:28
本文介绍了Rvest 阅读分离文章数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望从inquirer.net 抓取文章数据.

这是

我可以接受多行,因为我可以合并最终结果.但是由于还有其他非文章项目,所以它会弄乱我正在尝试做的事情(情绪分析).

有人可以协助如何清理这些数据,以便在每个文章链接旁边都有完整的文章吗?

我可以简单地合并不包括第一行和最后两行的结果,但要寻找一种更简洁的方法,因为我想对所有文章数据执行此操作,而不仅仅是对这一数据执行此操作.

解决方案

在对文章页面的结构进行简短的查看后,我建议使用 css 选择器:".article_align div p".

库(rvest)图书馆(dplyr)url <-https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light"read_html(url) %>%html_nodes(".article_align div p") %>%html_text()

I am looking to scrape article data from inquirer.net.

This is a follow-up question to Scrape Data through RVest

Here is the code that works based on the answer:

library(rvest)
#> Loading required package: xml2
library(tibble)

year  <- 2020
month <- 06
day   <- 13
url   <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)

div       <- read_html(url) %>% html_node(xpath = '//*[@id ="index-wrap"]')
links     <- html_nodes(div, xpath = '//a[@rel = "bookmark"]') 
post_date <- html_nodes(div, xpath = '//span[@class = "index-postdate"]') %>% 
             html_text()

test <- tibble(date = post_date,
               text = html_text(links),
               link = html_attr(links, "href"))

test
#> # A tibble: 261 x 3
#>    date     text                              link                              
#>    <chr>    <chr>                             <chr>                             
#>  1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#>  2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#>  3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#>  4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#>  5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#>  6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#>  7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#>  8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#>  9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows

I now want to add a new column to this output which has the full article for each row. Before doing the for-loop, I was investigating the html code for the first article: https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light

Digging into the html code, I'm noticing it is not that clean. From my findings so far, the main article data falls under #article_content , p. So my output right now is multiple rows separated and there is a lot of non-article data appearing. here is what I have currently:

article_data<-data.frame(test)
article_url<- read_html(article_data[2, 3])
article<-article_url %>%
   html_nodes("#article_content , p") %>%
   html_text()
View(article)

I'm ok with this being multiple rows because I can just union the final result. But since there are other non-article items then it will mess up what I am trying to do (sentiment analysis).

Can someone please assist on how to clean this data so that the full article is next to each article link?

I could simply just union the results excluding the first row and last 2 rows but looking for a cleaner way because I want to do this for all article data and not just this one.

解决方案

After a short look in the structure of the article page, I suggest using the css selector: ".article_align div p".

library(rvest)
library(dplyr)

url <- "https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light"

read_html(url) %>% 
  html_nodes(".article_align div p") %>% 
  html_text()

这篇关于Rvest 阅读分离文章数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆