合并多个CSV文件并删除R中的重复项 [英] Merge multiple CSV files and remove duplicates in R

查看:273
本文介绍了合并多个CSV文件并删除R中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有将近3.000个CSV文件(包含tweets)具有相同的格式,我想将这些文件合并成一个新文件,并删除重复的tweets。我遇到了讨论类似问题的各种主题,但文件的数量通常退出小。我希望你可以帮助我在R中写一个代码,这个工作既高效和有效。



CSV档案格式如下:



CSV格式的图片:



我更改了(在第2列和第3列)用户名(在Twitter上)为AE,实际名称为A1-E1。



原始文本文件:

 tweet ;local.time
1;2012-06-05 00:01:45 @A(A1):Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport。;A(A1);2012-06-05 00:01:45
2;2012-06-05 00:01:41 @B(B1) :Welterusten #BureauSport;B(B1);2012-06-05 00:01:41
3;2012-06-05 00:01:38 @C(C1): Echt ..... eindelijk een origineel sportprogramma #bureausport;C(C1);2012-06-05 00:01:38
4;2012-06-05 00:01 :38 @D(D1):LOL。\Na onderzoek op de Fontys Hogeschool durven wij te stellen dat.. \Want Fontys staat zo hoog aangeschreven?#bureausport;D(D1);2012- 06-05 00:01:38
5;2012-06-05 00:00:27 @E(E1):Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes;E (E1);2012-06-05 00:00:27

混乱,他们显然应该向右移动一列。每个CSV文件最多包含1500条推文。我想通过检查第2列(包含tweets)删除重复,因为它们应该是唯一的,并且作者列可以是相似的(例如,一个作者发布多个tweets)。



是否可以合并文件和删除重复项,或者这是要求麻烦,应该将进程分开吗?作为一个起点,我包括两个链接从Hayward Godwin的两个博客,讨论合并CSV文件的三种方法。



http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r- in-a-single-dataframe /



http://psychwire.wordpress.com/2011/06/05/testing-different-



显然有一些与我的问题相关的主题在这个网站(例如合并R中的多个CSV文件),但我避免了找到任何讨论合并和删除重复项。我真的希望你能帮助我和我的有限的R知识处理这个挑战!



虽然我已经尝试了一些代码,我在网上找到,在输出文件中。大约3.000个CSV文件具有上面讨论的格式。我的意思是尝试下面的代码(对于合并部分):

  filenames<  -  list.files(path =〜 )
do.call(rbind,lapply(filenames,read.csv,header = TRUE))


b $ b

这会导致以下错误:

 文件中出现错误连接
此外:警告消息:
在文件(文件,rt)中:
无法打开文件'..':没有这样的文件或目录

更新



代码:

 #抓取我们的文件名列表
filenames< - list.files(path =。 pattern ='^。* \\.csv $')
#写一个特殊的read.csv函数来做我们想要的操作
my.read.csv< - function(fnam) {read.csv(fnam,header = FALSE,skip = 1,sep =';',col.names = c('ID','tweet','author','local.time'),colClasses = rep 'character',4))}
#读取所有这些文件到一个巨大的data.frame
my.df< - do.call(rbind,lapply(filenames,my.read。 csv))
#删除重复的tweets
my.new.df< - my.df [!duplicateated(my.df $ tweet),]



但我碰到以下错误:



p>

  read.table中的错误(file = file,header = header,sep = sep,quote = quote,: 

在第四行之后,我得到:

 错误:未找到对象my.df

怀疑这些错误是由在csv文件的写入过程中发生的一些失败引起的,因为有一些作者/ local.time在错误的列中的情况。要么在它们应该在的位置的左侧或右侧,这将导致额外的列。我手动修改了5个文件,并测试了这些文件的代码,我没有得到任何错误。但它似乎没有什么发生。我没有从R的任何输出?



为了解决额外的列问题,我稍微调整了代码:

  #grab我们的文件名列表
filenames< - list.files(path =。,pattern ='^。* \\.csv $')
# .csv函数做我们想要的工作
my.read.csv< - function(fnam){read.csv(fnam,header = FALSE,skip = 1,sep =';',col.names = c('ID','tweet','author','local.time','extra'),colClasses = rep('character',5))}
# data.frame
my.df< - do.call(rbind,lapply(filenames,my.read.csv))
#删除重复的tweets
my.new.df < - my.df [!duplicateated(my.df $ tweet),]

代码在所有的文件,虽然R显然开始处理,我最终得到以下错误:

 错误read.table = file,header = header,sep = sep,quote = quote,:比列名多的列
此外:警告消息:
1:在read.table(file = file,header = sep = sep,quote = quote,:readTableHeader在'Twitts - di mei 29 19_22_30 2012 .csv'中找到不完整的最后一行
2:在read.table(file = file,header = header,sep = sep, quote = quote,:readTableHeader在'Twitts - di mei发现不完整的最后一行29 19_24_31 2012 .csv'

错误:未找到对象'my.df'
解决方案

>首先,通过在文件所在的文件夹中简化问题,并尝试将模式设置为只读文件结尾为.csv的文件,例如

  filenames<  -  list.files(path =。,pattern ='^。* \\.csv $')
my.df< call(rbind,lapply(filenames,read.csv,header = TRUE))

给你一个data.frame与所有tweets的内容



一个单独的问题是csv文件中的标题。幸运的是,你知道所有的文件是相同的,所以我会处理这样的东西:

  read.csv('fred。 csv',header = FALSE,skip = 1,sep =';',
col.names = c('ID','tweet','author','local.time'),
colClasses = rep('character',4))



<已更改,所有列都是字符,并用';'分隔



如果需要,我会稍后解析。



另一个单独的问题是data.frame中tweet的唯一性 - 但我不清楚,如果你希望他们是唯一的用户或全局唯一。对于全球唯一的tweets,像

  my.new.df<  -  my.df [!duplicateated(my.df $ tweet),] 

对于作者独有的,我会附加两个字段 - 工作没有真正的数据,虽然!

  my.new.df<  -  my.df [! df $ tweet,my.df $ author)),] 

 #获取我们的文件名列表
filenames< - list.files path =。,pattern ='^。* \\.csv $')
#编写一个特殊的read.csv函数来完成我们想要的操作
my.read.csv< ; - function(fnam){read.csv(fnam,header = FALSE,skip = 1,sep =';',
col.names = c('ID','tweet','author' local.time'),
colClasses = rep('character',4))}
#读取所有这些文件到一个巨大的data.frame
my.df<调用(rbind,lapply(filenames,my.read.csv))
#删除重复的tweets
my.new.df< - my.df [!duplicate ),]

根据第3行之后的修订警告,列。这不容易解决,除非你已经建议通过在规范中有太多的列。如果你删除规范,那么你会遇到问题,当你尝试rbind()data.frames在一起...



这里是一些代码使用for循环和一些调试cat()语句,以更明确哪些文件被打破,以便您可以修复的事情:

  filenames& -  list.files(path =。,pattern ='^。* \\.csv $')

n.files.processed < - 0#我们处理了多少文件?
for(fnam in filenames){
cat('about to read from file:',fnam,'\\\
')
if(exists('tmp.df')) (tmp.df)
tmp.df< - read.csv(fnam,header = FALSE,skip = 1,sep =';',
col.names = c('ID' twes','author','local.time','extra'),
colClasses = rep('character',5))
if(exists('tmp.df')& nrow(tmp.df)> 0)){
cat('successfully read:',nrow(tmp.df),'rows from',fnam,'\\\
')
#now允许附加一个包含原始文件名
#的列,以便调试文件内容更容易
tmp.df $ fnam < - fnam

#现在让rbind一起
if(exists('my.df')){
my.df< - rbind(my.df,tmp.df)
} else {
my.df< ; - tmp.df
}
} else {
cat('read'from',fnam,'\\\
')
}
}
cat('processed',n.files.processed,'files\\\
')
my.new.df< - my.df [!duplicateated(my.df $ tweet),]


I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.

The CSV files have the following format:

Image of CSV format:

I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.

Raw text file:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"

Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).

Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/

Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!

Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              

This results in the following error:

Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

Update

I have tried the following code:

 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

But I run into the following errors:

After the 3rd line I get:

  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names

After the 4th line I get:

  Error: object 'my.df' not found

I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?

To solve the extra column problem I adjusted the code slightly:

 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

I tried this code on all the files, although R clearly started processing, I eventually got the following errors:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : more columns than column names
 In addition: Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_22_30 2012 .csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_24_31 2012 .csv'

 Error: object 'my.df' not found

What did I do wrong?

解决方案

First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like

filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))

This should get you a data.frame with the contents of all the tweets

A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:

read.csv('fred.csv', header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4))

Nb. changed so all columns are character, and ';' separated

I'd parse out the time later if it was needed...

A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

my.new.df <- my.df[!duplicated(my.df$tweet),]

For unique by author, I'd append the two fields - hard to know what works without the real data though!

my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]

So bringing it all together and assuming a few things along the way...

# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]

Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...

Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

filenames <- list.files(path = ".", pattern='^.*\\.csv$')

n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
  cat('about to read from file:', fnam, '\n')
  if (exists('tmp.df')) rm(tmp.df)
  tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
             col.names=c('ID','tweet','author','local.time','extra'),
             colClasses=rep('character', 5)) 
  if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
    cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
    # now lets append a column containing the originating file name
    # so that debugging the file contents is easier
    tmp.df$fnam <- fnam

    # now lets rbind everything together
    if (exists('my.df')) {
      my.df <- rbind(my.df, tmp.df)
    } else {
      my.df <- tmp.df
    }
  } else {
    cat('  read NO rows from ', fnam, '\n')
  }
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

这篇关于合并多个CSV文件并删除R中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆