R:如何根据匹配的特定列标题查找文件夹中的选择文件 [英] R: how to find select files in a folder based on matching specific column title

查看:65
本文介绍了R:如何根据匹配的特定列标题查找文件夹中的选择文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉出现一般性问题.我正在寻找用于整理数据文件夹的指针,其中有许多.txt文件.它们都具有不同的标题,并且对于绝大多数文件而言,文件具有相同的维度,即列号相同.但是,麻烦的是某些文件,尽管具有相同的列数,但具有不同的列名.也就是说,在那些文件中,还测量了其他一些变量.

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.

我想清除这些文件,而不能简单地通过比较列号来做到.有什么方法可以传递列名并检查目录中有该列的文件,以便可以将它们删除到另一个文件夹中?

I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?

更新:

我已经创建了一个虚拟文件夹以包含文件来反映问题请查看下面的链接以访问我的Google驱动器上的文件.在此文件夹中,我取出了包含问题列的4个文件.

I have created a dummy folder to have files to reflect the problem please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3Rvwlta3D?usp =分享

问题在于代码似乎能够找到与选择标准匹配的文件,也就是问题列的实际名称,但是我无法在列表中提取此类文件的真实索引.有指针吗?

The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                         sep = "\t",
                         header = T,
                         nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )

推荐答案

如果您可以根据列名将要保留的文件与要删除的文件区分开,则可以在这些行中使用一些内容:

If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ';',
                             header = T,
                             nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

如果有很多文件,您应该避免循环或只是读入相应文件的标题.

If you have many files you should probably avoid the loop or just read in the header of the corresponding file.

在评论后进行

  • 通过增加nrows = 2,代码仅读取前2行+标头.
  • 我假设文件夹中的第一个文件具有您想要保留的结构,这就是为什么对照unique_names [1]检查column_names的原因.
  • files_to_keep包含您要保留的文件的名称
  • 您可以尝试在部分数据上运行它,然后查看它是否有效,并稍后再担心效率.我认为向量化方法可能会更好.

该代码适用于您的虚拟数据.

edit: This code works with your dummy-data.

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = "\t",
                             header = T,
                             nrows = 2,
                             encoding = "UTF-8",
                             check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
  'filename' = files_in_wd,
  'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

这篇关于R:如何根据匹配的特定列标题查找文件夹中的选择文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆