R 中的 Google 新闻 [英] Google News in R
问题描述
我正在尝试从 Google 新闻获取信息.这是我的代码:
I am trying to get info from Google News. This is my code:
library(rvest)
library(tidyverse)
news <- function(term) {
html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=es-419&gl=US&ceid=US%3Aes-419"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text()
)
return(news_dat)
}
noticias<-news("coronavirus")
使用此代码,我检索标题、链接和描述.好的.但我需要获得更多 2 个字段:日期和媒体.例如,如果昨天发布了有关冠状病毒疫苗的新闻,则日期就是那个.如果媒体是纽约时报,这个领域就是那个.但是我在 HTML 中找不到这些节点.修复我的代码添加这两个字段的一些想法?
With this code, I retrieve title, link and description. OK. But I need to get 2 fields more: date and media. For example, If a news about a vaccine for the coronavirus has been published yesterday, date will be that. If the media is New York Times, this field will be that. But I don't find these nodes in the HTML. Some idea to fix my code adding these two fields?
提前致谢.
推荐答案
也许试试这个
news <- function(term) {
url <- paste0("https://news.google.com/search?q=", term, "&hl=es-419&gl=US&ceid=US:es-419")
nodeset <- read_html(url) %>% html_nodes("article")
tibble::tibble(
Title = nodeset %>% html_nodes("h3") %>% html_text(),
Link = nodeset %>% html_nodes("h3 > a") %>% html_attr("href") %>% xml2::url_absolute(url),
Description = nodeset %>% html_nodes("div.Da10Tb.Rai5ob > span") %>% html_text(),
Source = nodeset %>% html_nodes("div.QmrVtf.RD0gLb.kybdz > div > a") %>% html_text(),
Time = nodeset %>% html_nodes("div.QmrVtf.RD0gLb.kybdz > div > time") %>% html_attr("datetime")
)
}
输出
> news("coronavirus")
# A tibble: 100 x 5
Title Link Description Source Time
<chr> <chr> <chr> <chr> <chr>
1 India reporta 41.100 casos nuevos d~ https://news.google.com/articles/CBMikwFodHRw~ "NUEVA DELHI (AP) — India reportó el domingo 41.1~ La Voz ~ 2020-11-~
2 El ecuatoriano Diego Palacios, de L~ https://news.google.com/articles/CBMigwFodHRw~ "El defensa del LAFC, Diego Palacios, se encuentr~ ESPN De~ 2020-11-~
3 Coronavirus: Austria endurece medid~ https://news.google.com/articles/CAIiEL2L0sxq~ "El canciller Sebastian Kurz pidió a la población~ DW (Esp~ 2020-11-~
4 ++Coronavirus hoy: Gobierno alemán ~ https://news.google.com/articles/CAIiEKCZppoU~ "\"Todos los países que levantaron sus restriccio~ DW (Esp~ 2020-11-~
5 ++Coronavirus hoy++ México supera e~ https://news.google.com/articles/CAIiEK8ndryG~ "El COVID-19 se consolidó como la cuarta causa de~ DW (Esp~ 2020-11-~
6 Coronavirus en Estados Unidos: 5 ci~ https://news.google.com/articles/CAIiEFFHgJgZ~ "La incertidumbre política y la emergencia sanita~ BBC New~ 2020-11-~
7 México supera el millón de casos de~ https://news.google.com/articles/CBMiRWh0dHBz~ "México sobrepasó el millón de casos confirmados ~ Reuters~ 2020-11-~
8 Massachusetts reporta 2.800 casos d~ https://news.google.com/articles/CBMiXmh0dHA6~ "Los casos registrados en la más reciente jornada~ El Tiem~ 2020-11-~
9 ¿Qué hará NYC para resistir una seg~ https://news.google.com/articles/CBMifWh0dHBz~ "Reaccionan políticos locales a la orden de cerra~ NY1 Not~ 2020-11-~
10 + Coronavirus hoy: Italia suma 544 ~ https://news.google.com/articles/CAIiEJ4KB7k2~ "Argentina registró este sábado (14.11.2020) 8.46~ DW (Esp~ 2020-11-~
# ... with 90 more rows
更新
我从来没有想过这样的案例:
I never thought of cases as follows:
- 嵌套文章.
- 缺少日期时间属性.
我已经更新了代码以解决所有这些情况,但代码的效率要低得多.无论如何,试试这个:
I have updated the code to account for all those cases, but the code becomes much less efficient. Anyway, try this:
news <- function(term) {
url <- paste0("https://news.google.com/search?q=", term, "&hl=es-419&gl=US&ceid=US:es-419")
nodeset <- read_html(url) %>% html_nodes("article")
dplyr::bind_rows(lapply(nodeset, function(x) tibble::tibble(
Title = x %>% html_node(".ipQwMb.ekueJc.RD0gLb") %>% html_text(),
Link = x %>% html_node(".ipQwMb.ekueJc.RD0gLb > a") %>% html_attr("href") %>% xml2::url_absolute(url),
Description = x %>% html_node("div.Da10Tb.Rai5ob > span") %>% html_text(),
Source = x %>% html_node("div.QmrVtf.RD0gLb.kybdz > div > a") %>% html_text(),
Time = x %>% html_node("div.QmrVtf.RD0gLb.kybdz > div > time") %>% html_attr("datetime")
)))
}
这篇关于R 中的 Google 新闻的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!