清理从 Web 上抓取的数据 [英] Cleaning Data Scraped from Web

查看:38
本文介绍了清理从 Web 上抓取的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对 r 有点陌生,我一直在做一个项目(只是为了好玩)来帮助我学习,但我遇到了一些我似乎无法在网上找到答案的问题.我正在尝试自学如何从网站上抓取数据,我从下面的代码开始,该代码从 247 项运动中检索了一些数据.

Slightly new to r and I've been working on a project (just for fun) to help me learn and I'm running into something that I can't seem to find answers for online. I am trying to teach myself to scrape websites for data, and I've started with the code below that retrieves some data from 247 sports.

library(rvest)
library(stringr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)
data <- 
  html_nodes(x   = link.scrap, 
             css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
  html_text(trim = TRUE) %>% 
  trimws()

当我查看数据时,它似乎是一个长度为 1 的向量,多个列表项存储为一个值.我遇到的问题是试图将它们分成各自的列.例如,当我运行下面的代码时,我认为应该在)"处拆分数据,然后从两个结果值中删除空格,我得到了一个奇怪的结果.

When I view the data it appears to be a vector of length 1, with multiple list items stored as one value. The problem I'm running into is trying to separate these out into their respective columns. For example, when I run the code below which I think should split the data at ")" and then remove the white spaces from both of the resulting values, I get a weird result.

f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima  El Camino College (Torrance, CA\", \"         DT 6-3 310    0.8681      39 4 9       Enrolled   1/9/2017\")"

我搞砸了其他一些事情,但没有成功.所以我想我的问题是,从这个 html 列表中获取数据并将其转换为每个数据点都有自己的列(即姓名、大学、职位、统计信息等)的格式的最佳方法是什么?

I have messed around with a few other things but with no success. So I guess my question is, what would be the best way to take data from this html list and get it into a format where every data point has it's own column (i.e. name, college, position, stats, etc)?

推荐答案

我修改了您的代码中的一些内容.

I've modified a couple of things in your code.

  • 采用通用方法来引用 css,因此能够提取整行.

  • Taken a generic approach to refer the css and hence able to extract for the entire rows.

收集单个列作为向量,然后构建一个数据框

Collected individual columns as vectors and then built a dataframe

请检查

library(rvest)
library(stringr)
library(tidyr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)

names <- link.scrap %>% html_nodes('div.name') %>% html_text()

pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text() 

status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text() 

data <- data.frame(names,pos,status, stringsAsFactors = F)

data <- data[-1,]

head(data)


> head(data)
                                                      names          pos                     status
2        Kamilo Tongamoa  Merced College (Merced, CA)        DT 6-5 320     Enrolled   8/24/2017   
3        Ray Lima  El Camino College (Torrance, CA)          DT 6-3 310      Enrolled   1/9/2017   
4  O'Rien Vance  George Washington (Cedar Rapids, IA)       OLB 6-3 235     Enrolled   6/12/2017   
5          Matt Leo  Arizona Western College (Yuma, AZ)     WDE 6-7 265     Enrolled   2/22/2017   
6            Keontae Jones  Colerain (Cincinnati, OH)         S 6-1 175     Enrolled   6/12/2017   
7      Cordarrius Bailey  Clarksdale (Clarksdale, MS)       WDE 6-4 210     Enrolled   6/12/2017   
> 

这篇关于清理从 Web 上抓取的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆