在没有任何标识符的情况下将数据集散布在选定的列上 [英] spreading the dataset on selected columns without any identifier

查看:40
本文介绍了在没有任何标识符的情况下将数据集散布在选定的列上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用很少的选定列来分散数据集,在这些列中没有唯一的标识符来标识行.为此,我使用了公开可用的虹膜数据集.

I would like to spread the dataset using few selected columns in which there are no unique identifiers to identify the rows. For this, I am using the publicly available iris dataset.

我尝试过先删除不需要的列,然后创建不重复的唯一值.稍后在其上方应用价差.

I have tried by removing the unwanted columns first and then creating the unique values without any duplicates. Later applying the spread on top of it.

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  spread(key=Species, value=Sepal.Length)

但是它给出了以下重复的标识符错误:

But it gives the below duplicate identifiers errors:

错误:行的重复标识符(1、2、3、4、5、6、7、8、9、10,11,12,13,14,14,15),(16,17,18,19,20,21,22,23,24,25,26,27,28、29、30、31、32、33、34、35、36),(37、38、39、40、41、42、43、44,45、46、47、48、49、50、51、52、53、54、55、56、57)

Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36), (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)

使用 row_number()创建了一个唯一的标识符,以便在分发数据时使用,并避免出现错误的重复行消息.

using row_number(), have created a unique identifier so as to use while spreading the data and avoid error duplicate rows message.

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
  mutate(row = row_number()) %>% spread(Species, Sepal.Length)

给出以下输出:

#    row setosa versicolor virginica
# 1    1    5.1         NA        NA
# 2    2    4.9         NA        NA
# 3    3    4.7         NA        NA
# ...
# 16  16     NA        7.0        NA
# 17  17     NA        6.4        NA
# 18  18     NA        6.9        NA
# ...
# 37  37     NA         NA       6.3
# 38  38     NA         NA       5.8
# 39  39     NA         NA       7.1

但是,由于行号的原因,有许多不希望出现的NA.我试图删除 row 数字,以便获得预期的值,但没有实现.

However, due to the row numbers, there are many NAs which is not expected. I tried to remove the row number so as to get the values as expected, but it did not materialize.

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  mutate(row = row_number()) %>%  spread(Species, Sepal.Length, -row)

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  mutate(row = row_number()) %>%  spread(Species, Sepal.Length, -one_of(row))

预期输出:

tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
  mutate(row = row_number()) %>% spread(Species, Sepal.Length)

cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
#       setosa versicolor virginica
#  [1,]    5.1        7.0       6.3
#  [2,]    4.9        6.4       5.8
#  [3,]    4.7        6.9       7.1
#  [4,]    4.6        5.5       6.5
#  [5,]    5.0        6.5       7.6
#  [6,]    5.4        5.7       4.9
#  [7,]    4.4        6.3       7.3
#  [8,]    4.8        4.9       6.7
#  [9,]    4.3        6.6       7.2
# [10,]    5.8        5.2       6.4
# [11,]    5.7        5.0       6.8
# [12,]    5.2        5.9       5.7
# [13,]    5.5        6.0       7.7
# [14,]    4.5        6.1       6.0
# [15,]    5.3        5.6       6.9
# [16,]    5.1        6.7       5.6
# [17,]    4.9        5.8       6.2
# [18,]    4.7        6.2       6.1
# [19,]    4.6        6.8       7.4
# [20,]    5.0        5.4       7.9
# [21,]    5.4        5.1       5.9

推荐答案

library(dplyr)
library(tidyr)

tbl_df(iris) %>%
  select(Species, Sepal.Length) %>%       # select columns of interest
  group_by(Species) %>%                   # for each value
  mutate(id = row_number()) %>%           # create a row identifier
  spread(Species, Sepal.Length)           # reshape dataset

# # A tibble: 50 x 4
#       id setosa versicolor virginica
#  * <int>  <dbl>      <dbl>     <dbl>
# 1     1    5.1        7.0       6.3
# 2     2    4.9        6.4       5.8
# 3     3    4.7        6.9       7.1
# 4     4    4.6        5.5       6.3
# 5     5    5.0        6.5       6.5
# 6     6    5.4        5.7       7.6
# 7     7    4.6        6.3       4.9
# 8     8    5.0        4.9       7.3
# 9     9    4.4        6.6       6.7
# 10    10   4.9        5.2       7.2
# # ... with 40 more rows

请特别注意如何创建/使用行标识符.上面的代码仅使用数据集的顺序.如果您以某种方式对其进行重新排序,您将获得不同的行组合.检查以下代码:

Be extra careful of how you create/use your row identifier. The code above just uses the order of the dataset. If you re-order it somehow, you're going to get different row combinations. Check the code below:

tbl_df(iris) %>%
  arrange(desc(Sepal.Length)) %>%         # order your values descending
  select(Species, Sepal.Length) %>%       # select columns of interest
  group_by(Species) %>%                   # for each value
  mutate(id = row_number()) %>%           # create a row identifier
  spread(Species, Sepal.Length)           # reshape dataset

# # A tibble: 50 x 4
#      id setosa versicolor virginica
# * <int>  <dbl>      <dbl>     <dbl>
# 1     1    5.8        7.0       7.9
# 2     2    5.7        6.9       7.7
# 3     3    5.7        6.8       7.7
# 4     4    5.5        6.7       7.7
# 5     5    5.5        6.7       7.7
# 6     6    5.4        6.7       7.6
# 7     7    5.4        6.6       7.4
# 8     8    5.4        6.6       7.3
# 9     9    5.4        6.5       7.2
# 10    10   5.4        6.4       7.2
# # ... with 40 more rows

与以前的区别是 arrange(desc.)),将确保您在顶部的行中具有较高的值(降序).

The arrange(desc.)), which is the difference from before, will make sure that you have the higher values on top rows (descending order).

这篇关于在没有任何标识符的情况下将数据集散布在选定的列上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆