我该如何处理非常大的列表 [英] how can I manipulate a very large list

查看:70
本文介绍了我该如何处理非常大的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有10000多个文件.我首先将目录设置为文件所在的文件夹.

I have over 10000 files. I first set my directory to the folder that the files are there.

然后我链接到所有这样的.txt格式的文件

Then I make a link to the all files with .txt format like this

filenames <- list.files("path to the file", pattern="*.txt", full.names=TRUE)

然后我用fread

ldf<- lapply(filenames, FUN=fread, header=TRUE)

为什么要担心?实际上,当我使用data.table时,例如,它弄乱了,那么我必须添加sep","row.names=FALSE etc.如果您知道更好的方法,请继续提出建议.无论如何

Why fread? actually when I use data.table , it messes up for example then i must add sep","and row.names=FALSEetc . If you know a better way, go ahead and advise please. In any case

完成此操作后,我得到了一个庞大的列表,现在需要从中提取数据.举例来说,我尝试在

After i did this, I end up with a huge list which I need now to extract data from it. As an example, I tried to make a representative data below

当然,在实际数据中,每个文件中都有更多的列,只有三个名为checkmyfileMyname

Of course in real data, there are way much more columns in each file, there only three named checkand myfileand Myname

现在,我尝试通过以下未保留的命令仅保留列myfileMyname.

Now I tried to keep only column myfileand Myname by the following command which did not make it.

t<- lapply(ldf, `[`, c(2,3))



 my.list <- list(structure(list(check = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE), myfile = c("", "1xLabel:13C(6)15N(4) [R11]", "1xOxidation [M7]", 
"", "1xLabel:13C(6)15N(4) [R11]", ""), myname = c("Q9Y383", "Q9Y383", 
"Q9Y383", "Q15366-2", "Q15366-2", "Q15366-2")), .Names = c("check", 
"myfile", "myname"), row.names = c(NA, -6L), class = c("data.table", 
"data.frame")), structure(list(
    check = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
    ), myfile = c(NA, NA, NA, NA, NA, NA, NA), Myname = c("F8W727", 
    "O76021", "P46783", "P35527", "Q96C45", "Q9Y383", "Q9Y383"
    )), .Names = c("check", "myfile", "myname"), row.names = c(NA, 
-7L), class = c("data.table", "data.frame")), 
    structure(list(check = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), myfile = c("", 
    "2xLabel:13C(6)15N(4) [R6; R8]; 1xCarbamidomethyl [C4]", 
    "", "", "", "1xCarbamidomethyl [C1]", "", "", "", "", "1xLabel:13C(6)15N(4) [R6]; 1xCarbamidomethyl [C5]"
    ), myname = c("P39019", "A2A3R5; P62753", "Q8IYB3; E9PCT1; M0R088; A9Z1X7; Q8IYB3-2", 
    "S4R3J4; O43390-3; B4DT28; O43390; O43390-2; O60506; O60506-2; E7ETM7", 
    "P07910-4; B4DY08; G3V4C1; P07910-2; G3V4W0; P07910; G3V5V7; P07910-3; G3V2D6; G3V2Q1", 
    "D6R9X9; D6RG19; P61927", "Q00839", "G3XAD8; H0YGI8; P31948; F5H0T1", 
    "Q8IYB3; E9PCT1; M0R088; A9Z1X7; Q8IYB3-2", "P42766", "Q9NX58; D6RDJ1"
    )), .Names = c("check", "myfile", "myname"), row.names = c(NA, 
    -11L), class = c("data.table", "data.frame")), 
    structure(list(check = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), myfile = c("", 
    "", "", "", "1xLabel:13C(6)15N(4) [R7]", "", "", "", "3xLabel:13C(6)15N(4) [R1; R7; R10]", 
    "", ""), myname = c("P61247", "P39019", "Q9NWH9", "P62917", 
    "P62917", "E9PCT1", "Q15149", "Q14152", "Q14152", "Q15020", 
    "Q02543")), .Names = c("check", "myfile", "myname"), row.names = c(NA, 
    -11L), class = c("data.table", "data.frame")))

我想要什么?

我要检查加载的所有文件中是否都有myfile和myname?然后有这样的输出

I want to check whether I have myfile and myname in all files I loaded ? and then have a output like this

  file1                file2                  file3                 file4
myfile   myname       myfile   myname      myfile   myname     myfile   myname 
 info     info         info      info        info    info       info     info

使其更具可复制性.我希望示例数据输出如下所示

To make it more reproducible. I want the example data output to be like below

    myout<- structure(list(myfile1 = structure(c(NA, 1L, 2L, NA, 1L, NA, 
NA, NA, NA, NA, NA), .Label = c("1xLabel:13C(6)15N(4) [R11]", 
"1xOxidation [M7]"), class = "factor"), Myname1 = structure(c(2L, 
2L, 2L, 1L, 1L, 1L, NA, NA, NA, NA, NA), .Label = c("Q15366-2", 
"Q9Y383"), class = "factor"), myfile2 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), Myname2 = structure(c(1L, 2L, 4L, 3L, 
5L, 6L, 6L, NA, NA, NA, NA), .Label = c("F8W727", "O76021", "P35527", 
"P46783", "Q96C45", "Q9Y383"), class = "factor"), myfile3 = structure(c(NA, 
3L, NA, NA, NA, 1L, NA, NA, NA, NA, 2L), .Label = c("1xCarbamidomethyl [C1]", 
"1xLabel:13C(6)15N(4) [R6]; 1xCarbamidomethyl [C5]", "2xLabel:13C(6)15N(4) [R6; R8]; 1xCarbamidomethyl [C4]"
), class = "factor"), Myname3 = structure(c(5L, 1L, 8L, 10L, 
4L, 2L, 7L, 3L, 8L, 6L, 9L), .Label = c("A2A3R5; P62753", "D6R9X9; D6RG19; P61927", 
"G3XAD8; H0YGI8; P31948; F5H0T1", "P07910-4; B4DY08; G3V4C1; P07910-2; G3V4W0; P07910; G3V5V7; P07910-3; G3V2D6; G3V2Q1", 
"P39019", "P42766", "Q00839", "Q8IYB3; E9PCT1; M0R088; A9Z1X7; Q8IYB3-2", 
"Q9NX58; D6RDJ1", "S4R3J4; O43390-3; B4DT28; O43390; O43390-2; O60506; O60506-2; E7ETM7"
), class = "factor"), myfile4 = structure(c(NA, NA, NA, NA, 1L, 
NA, NA, NA, 2L, NA, NA), .Label = c("1xLabel:13C(6)15N(4) [R7]", 
"3xLabel:13C(6)15N(4) [R1; R7; R10]"), class = "factor"), Myname4 = structure(c(3L, 
2L, 9L, 4L, 4L, 1L, 8L, 6L, 6L, 7L, 5L), .Label = c("E9PCT1", 
"P39019", "P61247", "P62917", "Q02543", "Q14152", "Q15020", "Q15149", 
"Q9NWH9"), class = "factor")), .Names = c("myfile1", "Myname1", 
"myfile2", "Myname2", "myfile3", "Myname3", "myfile4", "Myname4"
), class = "data.frame", row.names = c(NA, -11L))

新要求

然后我想将数据分成两个数据帧.一种是仅保留其myfile具有名为df1的特殊字符串的那些mynames,另一种是保留其myfile不包含任何特殊字符串或没有这些特殊字符串

NEW REQuest

Then I want to split the data into two dataframe. One is keeping only those mynames that their myfile has special strings called df1and one those mynames that their myfiles do not have anything or not those special strings

df1<- structure(list(myname1 = structure(c(3L, 2L, 1L, 1L), .Label = c("", 
"Q15366-2", "Q9Y383"), class = "factor"), myname2 = c(NA, NA, 
NA, NA), myname3 = structure(c(1L, 3L, 4L, 2L), .Label = c("A2A3R5", 
"D6RDJ1", "P62753", "Q9NX58"), class = "factor"), myname4 = structure(c(2L, 
3L, 1L, 1L), .Label = c("", "P62917", "Q14152"), class = "factor")), .Names = c("myname1", 
"myname2", "myname3", "myname4"), class = "data.frame", row.names = c(NA, 
-4L))


df2 <- structure(list(myname1 = structure(c(3L, 3L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("", "Q15366-2", "Q9Y383"), class = "factor"), 
    myname2 = structure(c(2L, 3L, 5L, 4L, 6L, 7L, 7L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L), .Label = c("", "F8W727", "O76021", "P35527", "P46783", 
    "Q96C45", "Q9Y383"), class = "factor"), myname3 = structure(c(29L, 
    33L, 11L, 18L, 1L, 34L, 35L, 22L, 6L, 20L, 21L, 23L, 4L, 
    10L, 27L, 7L, 2L, 25L, 15L, 24L, 16L, 26L, 13L, 14L, 8L, 
    9L, 31L, 8L, 9L, 31L, 32L, 17L, 3L, 28L, 12L, 33L, 11L, 19L, 
    5L, 34L, 30L), .Label = c(" A9Z1X7", " G3V4C1", " H0YGI8", 
    " O60506-2 ", "A9Z1X7", "B4DT28", "B4DY08", "D6R9X9", "D6RG19", 
    "E7ETM7", "E9PCT1", "F5H0T1", "G3V2D6", "G3V2Q1", "G3V4W0", 
    "G3V5V7", "G3XAD8", "M0R088", "M0R088 ", "O43390", "O43390-2", 
    "O43390-3", "O60506", "P07910", "P07910-2 ", "P07910-3 ", 
    "P07910-4", "P31948", "P39019", "P42766", "P61927", "Q00839", 
    "Q8IYB3", "Q8IYB3-2", "S4R3J4"), class = "factor"), myname4 = structure(c(4L, 
    3L, 10L, 5L, 2L, 9L, 7L, 8L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "E9PCT1", "P39019", "P61247", "P62917", "Q02543", "Q14152", 
    "Q15020", "Q15149", "Q9NWH9"), class = "factor")), .Names = c("myname1", 
"myname2", "myname3", "myname4"), class = "data.frame", row.names = c(NA, 
-41L))

推荐答案

这里是一种方法.您可以在之后更改列名称,并进行其他所需的修饰.旨在解决您的问题,您可以按照自己的喜好进行修饰.我写了一个带有三个参数的辅助函数add_rows.数据框,要添加的行数以及要填充的行数.

Here is an approach. You can change column names after and make other extra cosmetic changes that you like. This is intended to get to the core of your issue, you can dress it up how you like. I wrote a helper function add_rows that takes three arguments; a data frame, number of rows to add, and what to fill them with.

library(data.table)
#version 1.10+

#Helper function to add extra rows
add_rows <- function(DT, n, fill='') {
  rbindlist(list(DT, data.table(myfile=rep(fill,n), Myname=rep(fill,n))))
}

#Remove first column 
lst2 <- lapply(my.list, function(x) x[, c("myfile", "myname")]) #if using version <= 1.9.8, x[, -1, with=FALSE]

#data table with most rows
len <- max(sapply(lst2, nrow))

#Add rows
lst3 <- lapply(lst2, function(x) add_rows(x, len-nrow(x)))

#Order rows
#braces have backslashes added because without them those characters have special meaning in searches
tofind <- c("13C\\(6\\)15N\\(4\\)", "13C\\(6\\)")
lst4 <- lapply(lst3, function(DT) {
  pattern <- paste0(tofind, collapse="|")
  moveup <- DT[, grep(pattern, myfile)]
  myorder <- c(moveup, setdiff(1:nrow(DT), moveup))
  DT[myorder]
})

#Combine data
newdf <- do.call('cbind', lst4)

#Update names
setnames(newdf, paste0(names(newdf), rep(1:table(names(newdf))[1], each=2)))

newdf

这篇关于我该如何处理非常大的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆