使用来自网站的文本创建数据框 [英] Creating a dataframe with text from a website

查看:31
本文介绍了使用来自网站的文本创建数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求使用从网站复制的信息在 R 中创建一个数据框;数据不包含在文件中.完整数据列表位于:

I've been asked to create a data frame in R using information copied from a website; the data is not contained in a file. The full data list is at:

https://www.npr.org/2012/12/07/166400760/hollywood-heights-the-ups-downs-and-in-betweens

以下是部分数据:

Leading Men (Average American male: 5 feet 9.5 inches)

Dolph Lundgren — 6 feet 5 inches
John Cleese — 6 feet 5 inches
Michael Clarke Duncan — 6 feet 5 inches
Vince Vaughn — 6 feet 5 inches
Clint Eastwood — 6 feet 4 inches
Jimmy Stewart — 6 feet 3 inches
Bill Murray — 6 feet 1.5 inches

Leading Ladies (Average American female: 5 feet 4 inches)

Uma Thurman — 6 feet 0 inches
Brooke Shields — 6 feet 0 inches
Jane Lynch — 6 feet 0 inches

我应该使用 R 创建数据框,其中一列是名称,第二列是高度(以厘米为单位),第三列是性别.

I am supposed to use R to create the data frame, where one column is Name, the second is Height (in cm), and the third column is Gender.

我已将所有数据复制并粘贴到记事本中,手动制作了三个不同的列,并手动将高度转换为厘米.但这是手动创建数据框.

I have copied and pasted all data into Notepad, manually made three different columns, and converted height to cm by hand. But this is manually creating the data frame.

有没有办法使用给定的数据在 R 中制作数据框?

Is there a way to make a data frame in R using the data as given?

推荐答案

您可以复制整个列表,然后使用 read.line 将剪贴板上的文本导入 R.然后使用 您可以从每个部分的标题中提取性别,将其展开到下面的行,然后将第一列separate 到名称和高度.见下文;

You can copy that whole list and then use read.line to bring in the text on your clipboard into R. Then using regex you can extract the gender form the header of each section, expand it to the rows below, and then separate the first column to name and height. See below;

web.lines <- read.delim("clipboard", header = F) # reading data from clipboard

library(tidyverse)

web.lines %>% 
  mutate(gender = str_extract(V1, "Leading\\s+\\b(\\w+)\\b")) %>% # extracting gender from headers
  fill(gender , .direction = "down") %>% # filling the gender for all rows
  group_by(gender) %>% 
  slice(-1) %>% # removing the headers
  separate(V1, into = c("Name", "Height"), sep = " — ") # separating name and height


#> # A tibble: 59 x 3
#> # Groups:   gender [2]
#>    Name                  Height             gender        
#>    <chr>                 <chr>              <chr>         
#> 1  Uma Thurman           6 feet 0 inches    Leading Ladies
#> 2  Brooke Shields        6 feet 0 inches    Leading Ladies
#> 3  Jane Lynch            6 feet 0 inches    Leading Ladies
#> 4  Nicole Kidman         5 feet 11 inches   Leading Ladies
#> 5  Tilda Swinton         5 feet 10.5 inches Leading Ladies
#> ...
#> 28 Dolph Lundgren        6 feet 5 inches    Leading Men   
#> 29 John Cleese           6 feet 5 inches    Leading Men   
#> 30 Michael Clarke Duncan 6 feet 5 inches    Leading Men   
#> 31 Vince Vaughn          6 feet 5 inches    Leading Men   
#> 32 Clint Eastwood        6 feet 4 inches    Leading Men  
#> ...

这篇关于使用来自网站的文本创建数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆