合并多个 CSV 文件并删除 R 中的重复项 [英] Merge multiple CSV files and remove duplicates in R

查看:30
本文介绍了合并多个 CSV 文件并删除 R 中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有近 3000 个格式相同的 CSV 文件(包含推文),我想将这些文件合并到一个新文件中并删除重复的推文.我遇到过讨论类似问题的各种主题,但是文件数量通常很少.我希望您能帮助我在 R 中编写代码,以高效且有效地完成这项工作.

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.

CSV 文件具有以下格式:

The CSV files have the following format:

CSV 格式的图像:

我将(在第 2 列和第 3 列中)用户名(在 Twitter 上)更改为 A-E,将实际名称"更改为 A1-E1.

I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.

原始文本文件:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. "Na onderzoek op de Fontys Hogeschool durven wij te stellen dat.." Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"

不知何故我的标题搞砸了,他们显然应该向右移动一列.每个 CSV 文件最多包含 1500 条推文.我想通过检查第二列(包含推文)来删除重复项,因为它们应该是唯一的并且作者列可以相似(例如,一位作者发布多条推文).

Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).

是否可以将合并文件和删除重复文件结合起来,或者这是在自找麻烦,是否应该将进程分开?作为起点,我在 Hayward Godwin 的两个博客中包含了两个链接,其中讨论了合并 CSV 文件的三种方法.

Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/

显然,本网站上也有一些与我的问题相关的主题(例如 MergingR 中的多个 csv 文件),但我没有找到任何讨论合并和删除重复项的内容.我真的希望你能帮助我和我有限的 R 知识应对这个挑战!

Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!

虽然我尝试了一些在网上找到的代码,但这实际上并没有生成输出文件.大约 3.000 个 CSV 文件具有上述格式.我的意思是尝试了以下代码(用于合并部分):

Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              

这会导致以下错误:

Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

更新

我已经尝试了以下代码:

I have tried the following code:

 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

但我遇到以下错误:

在第三行之后我得到:

  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names

在第 4 行之后我得到:

After the 4th line I get:

  Error: object 'my.df' not found

我怀疑这些错误是由 csv 文件的写入过程中的一些失败引起的,因为有些情况下 author/local.time 在错误的列中.在他们应该在的位置的左侧或右侧,这会导致额外的列.我手动调整了 5 个文件,并在这些文件上测试了代码,我没有收到任何错误.然而,它似乎什么也没发生.我没有从 R 得到任何输出?

I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?

为了解决多列问题,我稍微调整了代码:

To solve the extra column problem I adjusted the code slightly:

 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

我在所有文件上都尝试了这段代码,虽然 R 显然开始处理,但我最终得到了以下错误:

I tried this code on all the files, although R clearly started processing, I eventually got the following errors:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : more columns than column names
 In addition: Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_22_30 2012 .csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_24_31 2012 .csv'

 Error: object 'my.df' not found

我做错了什么?

推荐答案

首先,通过在文件所在的文件夹中简化问题,并尝试将模式设置为只读文件结尾为.csv"的文件,所以有些东西喜欢

First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like

filenames <- list.files(path = ".", pattern='^.*\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))

这应该会为您提供一个包含所有推文内容的 data.frame

This should get you a data.frame with the contents of all the tweets

另一个问题是 csv 文件中的标题.幸好你知道所有文件都是相同的,所以我会处理这样的事情:

A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:

read.csv('fred.csv', header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4))

铌.已更改,因此所有列都是字符,并且 ';'分开

Nb. changed so all columns are character, and ';' separated

如果需要,我会在稍后解析出时间...

I'd parse out the time later if it was needed...

另一个单独的问题是 data.frame 中推文的唯一性 - 但我不清楚您是否希望它们对用户唯一或全局唯一.对于全球唯一的推文,类似

A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

my.new.df <- my.df[!duplicated(my.df$tweet),]

对于作者的独特性,我会附加这两个字段 - 但是如果没有真实数据,很难知道什么是有效的!

For unique by author, I'd append the two fields - hard to know what works without the real data though!

my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]

因此,将所有内容放在一起,并在此过程中假设一些事情......

So bringing it all together and assuming a few things along the way...

# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]

根据第 3 行之后的修订警告,这是具有不同列数的文件的问题.这通常不容易修复,除非您建议在规范中有太多列.如果您删除规范,那么当您尝试将 data.frames rbind() 结合在一起时会遇到问题...

Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...

这里有一些代码使用 for() 循环和一些调试 cat() 语句来更明确地说明哪些文件已损坏,以便您可以修复问题:

Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

filenames <- list.files(path = ".", pattern='^.*\.csv$')

n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
  cat('about to read from file:', fnam, '
')
  if (exists('tmp.df')) rm(tmp.df)
  tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
             col.names=c('ID','tweet','author','local.time','extra'),
             colClasses=rep('character', 5)) 
  if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
    cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '
')
    # now lets append a column containing the originating file name
    # so that debugging the file contents is easier
    tmp.df$fnam <- fnam

    # now lets rbind everything together
    if (exists('my.df')) {
      my.df <- rbind(my.df, tmp.df)
    } else {
      my.df <- tmp.df
    }
  } else {
    cat('  read NO rows from ', fnam, '
')
  }
}
cat('processed ', n.files.processed, ' files
')
my.new.df <- my.df[!duplicated(my.df$tweet),]

这篇关于合并多个 CSV 文件并删除 R 中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆