在 R 中使用 rvest 抓取链接时出现空节点 [英] Empty nodes when scraping links with rvest in R

查看:35
本文介绍了在 R 中使用 rvest 抓取链接时出现空节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是获得所有 Kaggle 挑战及其标题的链接.我正在使用库 rvest,但我似乎没有走远.当我进入几个 div 时,节点为空.

My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.

我一开始试图为第一个挑战做这件事,并且应该能够将其转移到之后的每个条目中.第一个条目的 xpath 是:

I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards. The xpath of the first entry is:

/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a

我的想法是一旦我在正确的标签中,就通过 html_attr( , "href") 获取链接.

My idea was to get the link via html_attr( , "href") once I am in the right tag.

我的想法是:

library(rvest)

url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")

我无法通过某个 div.以下代码段显示了我可以访问的最后一个节点

I cant go past a certain div. The following snippet shows the last node I can access

node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)

一旦我使用 html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div") 更进一步,节点将是空的.

Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.

我认为问题在于 kaggle 使用了一个智能列表,当我向下滚动时,它会进一步扩展.

I think the issue is that kaggle uses a smart list that expands the further I scroll down.

(我知道我可以使用 %>%.我正在保存每一步,以便我能够更轻松地访问和查看它们,以便能够了解它是如何正常工作的.)

(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)

推荐答案

我解决了这个问题.我认为我无法从 R 访问该站点的完整 html 代码,因为该表是由一个脚本加载的,该脚本通过用户滚动来扩展该表(因此是 HTML).

I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.

我通过手动展开表格、下载整个 HTML 网页并加载本地文件来解决它.

I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.

这篇关于在 R 中使用 rvest 抓取链接时出现空节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆