在没有任何标识符的情况下将数据集散布在选定的列上 [英] spreading the dataset on selected columns without any identifier
问题描述
我想使用很少的选定列来分散数据集,在这些列中没有唯一的标识符来标识行.为此,我使用了公开可用的虹膜数据集.
I would like to spread the dataset using few selected columns in which there are no unique identifiers to identify the rows. For this, I am using the publicly available iris dataset.
我尝试过先删除不需要的列,然后创建不重复的唯一值.稍后在其上方应用价差.
I have tried by removing the unwanted columns first and then creating the unique values without any duplicates. Later applying the spread on top of it.
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(key=Species, value=Sepal.Length)
但是它给出了以下重复的标识符错误:
But it gives the below duplicate identifiers errors:
错误:行的重复标识符(1、2、3、4、5、6、7、8、9、10,11,12,13,14,14,15),(16,17,18,19,20,21,22,23,24,25,26,27,28、29、30、31、32、33、34、35、36),(37、38、39、40、41、42、43、44,45、46、47、48、49、50、51、52、53、54、55、56、57)
Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36), (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
使用 row_number()
创建了一个唯一的标识符,以便在分发数据时使用,并避免出现错误的重复行消息.
using row_number()
, have created a unique identifier so as to use while spreading the data and avoid error duplicate rows message.
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
给出以下输出:
# row setosa versicolor virginica
# 1 1 5.1 NA NA
# 2 2 4.9 NA NA
# 3 3 4.7 NA NA
# ...
# 16 16 NA 7.0 NA
# 17 17 NA 6.4 NA
# 18 18 NA 6.9 NA
# ...
# 37 37 NA NA 6.3
# 38 38 NA NA 5.8
# 39 39 NA NA 7.1
但是,由于行号的原因,有许多不希望出现的NA.我试图删除 row
数字,以便获得预期的值,但没有实现.
However, due to the row numbers, there are many NAs which is not expected. I tried to remove the row
number so as to get the values as expected, but it did not materialize.
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -row)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -one_of(row))
预期输出:
tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
# setosa versicolor virginica
# [1,] 5.1 7.0 6.3
# [2,] 4.9 6.4 5.8
# [3,] 4.7 6.9 7.1
# [4,] 4.6 5.5 6.5
# [5,] 5.0 6.5 7.6
# [6,] 5.4 5.7 4.9
# [7,] 4.4 6.3 7.3
# [8,] 4.8 4.9 6.7
# [9,] 4.3 6.6 7.2
# [10,] 5.8 5.2 6.4
# [11,] 5.7 5.0 6.8
# [12,] 5.2 5.9 5.7
# [13,] 5.5 6.0 7.7
# [14,] 4.5 6.1 6.0
# [15,] 5.3 5.6 6.9
# [16,] 5.1 6.7 5.6
# [17,] 4.9 5.8 6.2
# [18,] 4.7 6.2 6.1
# [19,] 4.6 6.8 7.4
# [20,] 5.0 5.4 7.9
# [21,] 5.4 5.1 5.9
推荐答案
library(dplyr)
library(tidyr)
tbl_df(iris) %>%
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.1 7.0 6.3
# 2 2 4.9 6.4 5.8
# 3 3 4.7 6.9 7.1
# 4 4 4.6 5.5 6.3
# 5 5 5.0 6.5 6.5
# 6 6 5.4 5.7 7.6
# 7 7 4.6 6.3 4.9
# 8 8 5.0 4.9 7.3
# 9 9 4.4 6.6 6.7
# 10 10 4.9 5.2 7.2
# # ... with 40 more rows
请特别注意如何创建/使用行标识符.上面的代码仅使用数据集的顺序.如果您以某种方式对其进行重新排序,您将获得不同的行组合.检查以下代码:
Be extra careful of how you create/use your row identifier. The code above just uses the order of the dataset. If you re-order it somehow, you're going to get different row combinations. Check the code below:
tbl_df(iris) %>%
arrange(desc(Sepal.Length)) %>% # order your values descending
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.8 7.0 7.9
# 2 2 5.7 6.9 7.7
# 3 3 5.7 6.8 7.7
# 4 4 5.5 6.7 7.7
# 5 5 5.5 6.7 7.7
# 6 6 5.4 6.7 7.6
# 7 7 5.4 6.6 7.4
# 8 8 5.4 6.6 7.3
# 9 9 5.4 6.5 7.2
# 10 10 5.4 6.4 7.2
# # ... with 40 more rows
与以前的区别是 arrange(desc.))
,将确保您在顶部的行中具有较高的值(降序).
The arrange(desc.))
, which is the difference from before, will make sure that you have the higher values on top rows (descending order).
这篇关于在没有任何标识符的情况下将数据集散布在选定的列上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!