您如何将项目拼凑在一起,以免丢失索引? [英] How do you scrape items together so you don't lose the index?

查看:33
本文介绍了您如何将项目拼凑在一起,以免丢失索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 RVest 进行一些基本的网络抓取,并且正在返回结果,但是数据并未相互对齐.意思是,我正在获取这些项目,但它们从网站上乱序,因此我正在抓取的 2 个数据元素无法加入​​ data.frame.

I am doing some basic webscraping with RVest and am getting results to return, however the data isnt lining up with each other. Meaning, I am getting the items but they are out of order from the site so the 2 data elements I am scraping cant be joined in a data.frame.

library(rvest)
library(tidyverse)

base_url<- "https://www.uchealth.com/providers"
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 
dept <- read_html(base_url) %>%
  html_nodes('[class=department last]') %>%
  html_text()

我希望能够创建一个数据框:

I was expecting to be able to create a dataframe of :

Location  Department

有什么建议吗?我想知道是否有一个索引可以将这些项目放在一起,但我什么也没看到.

Any suggestions? I was wondering if there is an index that would keep these items together but I didnt see anything.

我也试过这个,但没有任何运气.该位置的起始值似乎有误:

I tried this also and did not have any luck. It seems the location is getting an erroneous starting value:

scraping <- function(

base_url = "https://www.uchealth.com/providers"
)
{
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 

dept <- read_html(base_url) %>%
  html_nodes('[class=specialties]') %>%
  html_text()

data.frame(
  loc = ifelse(length(loc)==0, NA, loc),
  dept = ifelse(length(dept)==0, NA, loc), 
  stringsAsFactors=F
)

}

推荐答案

您面临的问题是,并非每个子节点都存在于所有父节点中.处理这些情况的最佳方法是收集列表/向量中的所有父节点,然后使用 html_node 函数从每个父节点中提取所需的信息.html_node 将始终为每个节点返回 1 个结果,即使它是 NA.

The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node function. html_node will always return 1 result for every node, even if it is NA.

library(rvest)

#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)

#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]')  %>% html_children()

#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()

提供者、部门和位置的长度应该都相等.

The length of providers, dept and location should all be equal.

这篇关于您如何将项目拼凑在一起,以免丢失索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆