合并多个CSV文件并删除R中的重复项 [英] Merge multiple CSV files and remove duplicates in R

查看:1487
本文介绍了合并多个CSV文件并删除R中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有近3.000个CSV文件(含有tweets)格式相同,我想将这些文件合并成一个新的文件,并删除重复的tweets。我遇到了讨论类似问题的各种主题,但是文件的数量通常会很小。希望您能帮助我在R中编写一个代码,从而有效和高效地完成这项工作。

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.

CSV文件格式如下:

The CSV files have the following format:

CSV格式的图像:

Image of CSV format:

我已更改(在第2列和第3列) AE的用户名(在Twitter上)和A1-E1的实际名称。

I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.

原始文本文件:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"

搞砸了,他们显然应该向右移一列。每个CSV文件最多包含1500个推文。我想通过检查第二列(包含推文)来删除重复项,因为这些应该是唯一的,作者列可以是相似的(例如一个作者发布多个tweets)。

Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).

是否可以合并文件并删除重复的文件,或者是否要求出现问题,并且应该将进程分开?作为起点,我包括两个链接Hayward Godwin的两个博客讨论了三种合并CSV文件的方法。

Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-单一数据框/

http://psychwire.wordpress.com/2011/06/05/testing-different-方法合并一个文件到数据框/

显然有一些与我的问题有关的主题网站(例如将R中的多个csv文件合并),但是我避开'我发现任何讨论合并和删除重复的内容。我真的希望你可以帮助我和我有限的R知识处理这个挑战!

Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!

尽管我尝试过一些我在网上找到的代码,但实际上并没有结果在输出文件中。大约3.000个CSV文件具有上述格式。我的意思是尝试以下代码(对于合并部分):

Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              

这将导致以下错误:

Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

更新

代码:

 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

但是我碰到以下错误:

第三行后我得到:

  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names

第四行之后,我得到:

  Error: object 'my.df' not found

我怀疑这些错误是由于csv文件的写入过程中发生的一些失败造成的,因为有一些作者/本地时间的错误列。在左边或右边,他们应该在哪里导致一个额外的列。我手动调整了5个文件,并测试了这些文件上的代码,我没有收到任何错误。然而,它似乎什么也没有发生。我没有得到R的任何输出?

I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?

为了解决额外的列问题,我稍微调整了代码:

To solve the extra column problem I adjusted the code slightly:

 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

所有文件的代码,虽然R明确地开始处理,我最终得到以下错误:

I tried this code on all the files, although R clearly started processing, I eventually got the following errors:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : more columns than column names
 In addition: Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_22_30 2012 .csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_24_31 2012 .csv'

 Error: object 'my.df' not found

我做错了什么?

推荐答案

首先,通过在文件夹的文件夹中简化问题,并尝试将模式设置为只读文件结尾为'.csv',所以像

First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like

filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))

这应该会让你有一个data.frame与所有的tweets的内容

This should get you a data.frame with the contents of all the tweets

单独的问题是csv文件中的标题。幸运的是,你知道所有文件是一样的,所以我会处理这样的事情:

A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:

read.csv('fred.csv', header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4))

Nb。更改所有列都是字符,';'分隔

Nb. changed so all columns are character, and ';' separated

如果需要,我会解析出这个时间...

I'd parse out the time later if it was needed...

另一个单独的问题是数据框架中的推文的唯一性 - 但是我不清楚您是否希望它们对于用户是唯一的或全局唯一的。对于全球唯一的推文,像

A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

my.new.df <- my.df[!duplicated(my.df$tweet),]

为作者独一无二,我会附加两个字段 - 很难知道工作没有真正的数据!

For unique by author, I'd append the two fields - hard to know what works without the real data though!

my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]

所以把它放在一起,假设

So bringing it all together and assuming a few things along the way...

# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]

根据第3行之后的修订警告,对不同列数的文件有问题。这一般不容易解决,除非你已经在规范中有太多列了。如果你删除规范,那么你会遇到问题,当你尝试rbind()的data.frames在一起...

Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...

这里是一些代码使用for()循环和一些调试cat()语句,使更明确的哪些文件被破坏,以便您可以修复的东西:

Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

filenames <- list.files(path = ".", pattern='^.*\\.csv$')

n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
  cat('about to read from file:', fnam, '\n')
  if (exists('tmp.df')) rm(tmp.df)
  tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
             col.names=c('ID','tweet','author','local.time','extra'),
             colClasses=rep('character', 5)) 
  if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
    cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
    # now lets append a column containing the originating file name
    # so that debugging the file contents is easier
    tmp.df$fnam <- fnam

    # now lets rbind everything together
    if (exists('my.df')) {
      my.df <- rbind(my.df, tmp.df)
    } else {
      my.df <- tmp.df
    }
  } else {
    cat('  read NO rows from ', fnam, '\n')
  }
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

这篇关于合并多个CSV文件并删除R中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆