在 R 中 - 使用 rvest 爬行 - 无法使用 html_text 函数获取 HTML 标签中的文本 [英] in R - crawling with rvest - fail to get the texts in HTML tag using html_text function

查看:32
本文介绍了在 R 中 - 使用 rvest 爬行 - 无法使用 html_text 函数获取 HTML 标签中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

url <-"http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392"

hh = read_html(GET(url),encoding = "EUC-KR")

#guess_encoding(hh)

html_text(html_node(hh, 'div.par'))
#html_text(html_nodes(hh ,xpath='//*[@id="news_body_id"]/div[2]/div[3]'))

我正在尝试使用 R 中的 rvest 抓取新闻数据(仅供练习).

I'm trying to crawling the news data(just for practice) using rvest in R.

当我在上面的主页上尝试时,我无法从页面中获取文本.(Xpath 也不起作用.)

When I tried it on the homepage above, I failed to fetch the text from the page. (Xpath doesn't work either.)

我认为我没有找到包含我想在页面上获取的文本的链接.但是当我尝试使用 html_text 函数从该链接中提取文本时,它被提取为"或空白.

I do not think I failed to find the link that contain texts that I want to get on the page. But when I try to extract the text from that link using html_text function, it is extracted as "" or blanks.

我不知道为什么..我没有任何 HTML 和抓取方面的经验.

I can't find why.. I don't have any experience with HTML and crawling.

我猜是包含新闻正文上下文的 HTML 标签,有class"和data-dzo"(我不知道是什么).

What I'm guessing is the HTML tag that contain news body contexts, has "class" and "data-dzo"(I don't know what is it).

所以如果有人告诉我如何解决它或让我知道我可以在谷歌上找到的搜索关键字来解决这个问题.

So If anyone tell me how to solve it or let me know the search keywords that I can find on google to solve this problem.

推荐答案

它动态构建了相当多的页面.这应该会有所帮助.

It builds quite a bit of the page dynamically. This should help.

文章内容位于 XML 文件中.可以从 contid 参数构造 URL.要么传入一个完整的文章 HTML URL(就像你例子中的那个),要么只传入 contid 值,它会返回一个 xml2 xml_document与解析的 XML 结果:

The article content is in an XML file. The URL can be constructed from the contid parameter. Either pass in a full article HTML URL (like the one in your example) or just the contid value to this and it'll return an xml2 xml_document with the parsed XML results:

#' Retrieve article XML from chosun.com
#' 
#' @param full_url_or_article_id either a full URL like 
#'        `http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392`
#'        or just the id (e.g. `1999080570392`)
#' @return xml_document
read_chosun_article <- function(full_url_or_article_id) {

  require(rvest)
  require(httr)

  full_url_or_article_id <- full_url_or_article_id[1]

  if (grepl("^http", full_url_or_article_id)) {
    contid <- httr::parse_url(full_url_or_article_id)
    contid <- contid$query$contid
  } else {
    contid <- full_url_or_article_id
  }

  # The target article XML URLs are in the following format:
  #
  # http://news.chosun.com/priv/data/www/news/1999/08/05/1999080570392.xml
  #
  # so we need to construct it from substrings in the 'contid'

  sprintf(
    "http://news.chosun.com/priv/data/www/news/%s/%s/%s/%s.xml",
    substr(contid, 1, 4), # year
    substr(contid, 5, 6), # month
    substr(contid, 7, 8), # day
    contid
  ) -> contid_xml_url

  res <- httr::GET(contid_xml_url)

  httr::content(res)  

}

read_chosun_article("http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392")
## {xml_document}
## <content>
##  [1] <id>1999080570392</id>
##  [2] <site>\n  <id>1</id>\n  <name><![CDATA[www]]></name>\n</site>
##  [3] <category>\n  <id>3N1</id>\n  <name><![CDATA[사람들]]></name>\n  <path ...
##  [4] <type>0</type>
##  [5] <template>\n  <id>2006120400003</id>\n  <fileName>3N.tpl</fileName> ...
##  [6] <date>\n  <created>19990805192041</created>\n  <createdFormated>199 ...
##  [7] <editor>\n  <id>chosun</id>\n  <email><![CDATA[webmaster@chosun.com ...
##  [8] <source><![CDATA[0]]></source>
##  [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n  <page no="0">\n    <paragraph no="0">\n      <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...

read_chosun_article("1999080570392")
## {xml_document}
## <content>
##  [1] <id>1999080570392</id>
##  [2] <site>\n  <id>1</id>\n  <name><![CDATA[www]]></name>\n</site>
##  [3] <category>\n  <id>3N1</id>\n  <name><![CDATA[사람들]]></name>\n  <path ...
##  [4] <type>0</type>
##  [5] <template>\n  <id>2006120400003</id>\n  <fileName>3N.tpl</fileName> ...
##  [6] <date>\n  <created>19990805192041</created>\n  <createdFormated>199 ...
##  [7] <editor>\n  <id>chosun</id>\n  <email><![CDATA[webmaster@chosun.com ...
##  [8] <source><![CDATA[0]]></source>
##  [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n  <page no="0">\n    <paragraph no="0">\n      <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...

注意:我浏览了那个网站,发现这违反了他们的服务条款,但似乎并没有,但我也依赖谷歌翻译,这可能使查找变得更加困难.务必确保您可以合法(并且,如果您关心道德,在道德上)抓取此内容以用于您打算的任何用途.

NOTE: I poked around that site to see this violates their terms of service and it does not seem to but I also relied on google translate and it may have made that harder to find. It's important to ensure you can legally (and, ethically, if you care about ethics) scrape this content for whatever use you intend.

这篇关于在 R 中 - 使用 rvest 爬行 - 无法使用 html_text 函数获取 HTML 标签中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆