对具有错误的行使用带有重复标识符的点差 [英] Using spread with duplicate identifiers for rows giving error

查看:151
本文介绍了对具有错误的行使用带有重复标识符的点差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据如下:

df <- read.table(header = T, text =
        "GeneID    Gene_Name   Species    Paralogues    Domains   Functional_Diversity
         1234      DDR1        hsapiens   14            2         8.597482
         5678      CSNK1E      celegans   70            4         8.154788
         9104      FGF1        Chicken    3             0         5.455874
         4575      FGF1        hsapiens   4             6         6.745845")

我需要它看起来像:

   Gene_Name    hsapiens    celegans    ggalus
   DDR1         8.597482    NA          NA
   CSNK1E       NA          8.154788    NA
   FGF1         6.745845    NA          5.455874

我尝试使用:

library(tidyverse)
df %>% 
    select(Gene_Name, Species, Functional_Diversity) %>% 
    spread(Species, Functional_Diversity)

我的实际数据包括130,000行(许多基因名称大约14,000个唯一),包括9种.

My actual data consists of 130,000 rows (many Gene Names approx 14,000 unique), consisting of 9 species.

当我将此方法应用于实际数据时,我得到:

When I apply this method to my actual data I get:

Error: Duplicate identifiers for rows (16691, 19988), (20938, 21033), (1232, 21150), (2763, 21465), (1911, 20844), (17274, 17657, 18293, 18652, 18726, 19006, 19025), (496, 22555), (17227, 17608, 18211, 18605, 18676, 18967, 19002), (13569, 21807), (10261, 21014, 21607), (20816, 21553), (2244, 22025), (6194, 21910), (12217, 21555), (2936, 21078), (16484, 20911), (12216, 21851), (9289, 21791), (10340, 21752), (1714, 22077), (13216, 22618), (6076, 22371), (14731, 21717), (160, 22472), (11553, 22635), (17183, 17583, 18510, 18608, 18661, 18896, 19108), (138, 20028), (17185, 17584, 18330, 18415, 18500, 18981, 19063), (9726, 22440), (17238, 17617, 18905, 18960, 18996, 19134), (1638, 21645), (4631, 20821), (9162, 22463), (319, 20900), (13600, 22227), (9312, 20011), (14825, 21711, 21764), (3381, 21134), (505, 21133), (5954, 20013), (5948, 21313), (17233, 17612, 18187, 18311, 18411, 18708, 18980), (16953, 20902, 21845), (20710, 22477), (20519, 20973), (10204, 21197, 21213), (2933, 20707), (4302,

推荐答案

要仅查看具有重复标识符"的行,可以使用...

To see just the rows that have "Duplicate identifiers", you could use...

df %>% 
  group_by(Gene_Name, Species) %>% 
  mutate(n = n()) %>% 
  filter(n > 1)

为确保spread正常运行,即使您的行具有重复的标识符,您也可以添加行号列,以确保每一行都是唯一的...

To ensure the spread works, even if you have rows with duplicate identifiers, you can add a row number column which will guarantee that each row is unique...

df %>% 
  select(Gene_Name, Species, Functional_Diversity) %>% 
  mutate(row = row_number()) %>% 
  spread(Species, Functional_Diversity)

这篇关于对具有错误的行使用带有重复标识符的点差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆