许多CSV文件中删除的DUP [英] remove dups from many csv files

查看:216
本文介绍了许多CSV文件中删除的DUP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定n的CSV文件,他们在大小为100 GB,我需要删除基于以下规则和条件重复行:

Given n csv files where they add up to 100 GB in size, I need to remove duplicate rows based on the following rules and conditions:


  • 的CSV文件,编号为1.csv n.csv,每个文件的大小大约为50MB。

  • 的第一列是一个字符串键,2行被认为是DUP,如果他们的第一列是相同的。

  • 我想通过保持一个更新的文件之一(2.csv被认为是晚于1.csv)
  • 要删除的DUP
  • The csv files are numbered 1.csv to n.csv, and each file is about 50MB in size.
  • The first column is a string key, 2 rows are considered dup if their first columns are the same.
  • I want to remove dups by keeping the one in a later file (2.csv is considered later than 1.csv)

我的算法是下面,我想知道是否有一个更好的。

My algorithm is the following, I want to know if there's a better one.


  • 所有文件合并成一个巨大的文件

  • merge all files into one giant file

cat *.csv > one.csv


  • 排序的CSV

  • sort the csv

    sort one.csv >one_sorted.csv
    


  • 不知道如何在这一点上elimiate的DUP。 uniq的有跳过前N个字段-f标志,但在我的情况下,我想跳过所有,但第1场。

  • not sure how to elimiate dups at this point. uniq has a -f flag that skips the first N fields, but in my case I want to skip all but the first 1 fields.

    我需要最后一步帮助(消除在排序文件的DU​​P)。也就是有一个更有效的算法?

    I need help with the last step (eliminating dups in a sorted file). Also is there a more efficient algorithm?

    推荐答案

    如果足够的数据将适合在内存中,<一个href=\"http://stackoverflow.com/questions/12888748/remove-dups-from-many-csv-files/12888913#12888913\"><$c$c>awk溶液史蒂夫的是pretty整齐,不管你写排序通过管道命令中的 AWK 或简单地通过管道的朴实 AWK 来输出排序在shell水平。

    If you can keep the lines in memory

    If enough of the data will fit in memory, the awk solution by steve is pretty neat, whether you write to the sort command by pipe within awk or simply by piping the output of the unadorned awk to sort at the shell level.

    如果你有100个吉布数据可能有3%的重复,那么你就需要能够存储100吉布的数据在内存中。这是一个很大的主内存。 64位系统可能与虚拟内存处理它,但它很可能会相当缓慢运行。

    If you have 100 GiB of data with perhaps 3% duplication, then you'll need to be able to store 100 GiB of data in memory. That's a lot of main memory. A 64-bit system might handle it with virtual memory, but it is likely to run rather slowly.

    如果您不能容纳足够的内存中的数据,那么今后的任务更加困难,并需要在文件至少两次扫描。我们需要假定,亲统,可以至少适合存储器中的每个键,随着时代的关键已经出现的次数的计数沿

    If you can't fit enough of the data in memory, then the task ahead is much harder and will require at least two scans over the files. We need to assume, pro tem, that you can at least fit each key in memory, along with a count of the number of times the key has appeared.


    1. 扫描1:读取文件。

      • 计数的次数每个键出现在输入。

      • AWK ,使用 ICOUNT [$ 1] ++

    1. Scan 1: read the files.
      • Count the number of times each key appears in the input.
      • In awk, use icount[$1]++.

    • 计数的次数每个键已经出现数; ocount [$ 1] ++

    • 如果 ICOUNT [$ 1] == ocount [$ 1] ,然后打印就行了。

    • Count the number of times each key has appeared; ocount[$1]++.
    • If icount[$1] == ocount[$1], then print the line.

    (这里假定可存储密钥和计数两次;另一种方法是用 ICOUNT (仅)在两个扫描,递增在扫描1和在扫描2递减,打印值当计数递减到零。)

    (This assumes you can store the keys and counts twice; the alternative is to use icount (only) in both scans, incrementing in Scan 1 and decrementing in Scan 2, printing the value when the count decrements to zero.)

    我可能会使用Perl这个而非 AWK ,如果仅仅是因为它会更容易重读在Perl文件比 AWK

    I'd probably use Perl for this rather than awk, if only because it will be easier to reread the files in Perl than in awk.

    样,如果你甚至无法适应键及其计数到内存呢?然后,你正面临着​​一些严重的问题,尤其是因为脚本语言可能无法干净,只要你愿意的内存不足向您汇报。我不会试图跨过这道坎,直到它的证明是必要的。如果有必要,我们需要对文件集的一些统计数据来知道什么是可能的:

    What about if you can't even fit the keys and their counts into memory? Then you are facing some serious problems, not least because scripting languages may not report the out of memory condition to you as cleanly as you'd like. I'm not going to attempt to cross this bridge until it's shown to be necessary. And if it is necessary, we'll need some statistical data on the file sets to know what might be possible:


    • 记录的平均长度。

    • 不同的键的数量。

    • 有N出现的每一个N = 1,2个不同的按键数量,...的最大

    • 一个密钥长度。

    • 可以被安装到记忆键加计数的范围。

    和可能是一些人......所以,正如我所说,我们不要试图越过这条桥,直到它被证明是必要的。

    And probably some others...so, as I said, let's not try crossing that bridge until it is shown to be necessary.

    实例数据

    $ cat x000.csv
    abc,123,def
    abd,124,deg
    abe,125,deh
    $ cat x001.csv
    abc,223,xef
    bbd,224,xeg
    bbe,225,xeh
    $ cat x002.csv
    cbc,323,zef
    cbd,324,zeg
    bbe,325,zeh
    $ perl fixdupcsv.pl x???.csv
    abd,124,deg
    abe,125,deh
    abc,223,xef
    bbd,224,xeg
    cbc,323,zef
    cbd,324,zeg
    bbe,325,zeh
    $ 
    

    请注意如果没有技嘉规模测试!

    Note the absence of gigabyte-scale testing!

    本使用'计数,倒计时'的技术。

    This uses the 'count up, count down' technique.

    #!/usr/bin/env perl
    #
    # Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
    
    use strict;
    use warnings;
    
    # Scan 1 - count occurrences of each key
    
    my %count;
    my @ARGS = @ARGV;   # Preserve arguments for Scan 2
    
    while (<>)
    {
        $_ =~ /^([^,]+)/;
        $count{$1}++;
    }
    
    # Scan 2 - reread the files; count down occurrences of each key.
    # Print when it reaches 0.
    
    @ARGV = @ARGS;      # Reset arguments for Scan 2
    
    while (<>)
    {
        $_ =~ /^([^,]+)/;
        $count{$1}--;
        print if $count{$1} == 0;
    }
    

    而(LT;&GT;)符号破阵 @ARGV (因此副本 @ARGS 做其他事情)之前,但是这也意味着,如果您重置 @ARGV 为原始值,它将运行通过文件的第二时间。经测试用Perl 5.16.0和5.10.0在Mac OS X 10.7.5。

    The 'while (<>)' notation destroys @ARGV (hence the copy to @ARGS before doing anything else), but that also means that if you reset @ARGV to the original value, it will run through the files a second time. Tested with Perl 5.16.0 and 5.10.0 on Mac OS X 10.7.5.

    这是Perl的; TMTOWTDI 。你可以使用:

    This is Perl; TMTOWTDI. You could use:

    #!/usr/bin/env perl
    #
    # Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
    
    use strict;
    use warnings;
    
    my %count;
    
    sub counter
    {
        my($inc) = @_;
        while (<>)
        {
            $_ =~ /^([^,]+)/;
            $count{$1} += $inc;
            print if $count{$1} == 0;
        }
    }
    
    my @ARGS = @ARGV;   # Preserve arguments for Scan 2
    counter(+1);
    @ARGV = @ARGS;      # Reset arguments for Scan 2
    counter(-1);
    

    有可能的方式融为一体preSS循环的身体也一样,但我觉得那里的东西相当清楚,preFER清晰了极致简洁。

    There are probably ways to compress the body of the loop, too, but I find what's there reasonably clear and prefer clarity over extreme terseness.

    您需要present以正确的顺序文件名中的 fixdupcsv.pl 脚本。既然你通过关于2000.csv具有1.csv编号的文件,重要的是不要列出它们的字母顺序。其他答案建议 LS -v * .csv的使用GNU LS 扩展选项。如果是可用的,这是最好的选择。

    You need to present the fixdupcsv.pl script with the file names in the correct order. Since you have files numbered from 1.csv through about 2000.csv, it is important not to list them in alphanumeric order. The other answers suggest ls -v *.csv using the GNU ls extension option. If it is available, that's the best choice.

    perl fixdupcsv.pl $(ls -v *.csv)
    

    如果不可用,那么你需要做一个数字排序上的名字:

    If that isn't available, then you need to do a numeric sort on the names:

    perl fixdupcsv.pl $(ls *.csv | sort -t. -k1.1n)
    


    awk的解决方案

    awk -F, '
    BEGIN   {
                for (i = 1; i < ARGC; i++)
                {
                    while ((getline < ARGV[i]) > 0)
                        count[$1]++;
                    close(ARGV[i]);
                }
                for (i = 1; i < ARGC; i++)
                {
                    while ((getline < ARGV[i]) > 0)
                    {
                        count[$1]--;
                        if (count[$1] == 0) print;
                    }
                    close(ARGV[i]);
                }
            }' 
    

    这忽略了 AWK 的与生俱来的读循环,并做了所有读明确(你可以取代结束开始,并会得到相同的结果)。逻辑紧密基于在许多方面Perl的逻辑。经测试在Mac OS X 10.7.5既BSD AWK 和GNU AWK 。有趣的是,GNU AWK 坚持在调用括号关闭其中BSD AWK 没有。在的close()调用是必要的,第一个循环,使第二循环工作的。在 close()方法中的第二个循环要求在那里preserve对称性和整洁 - 但他们也可能是相关的,当你避开处理几百个文件在一次运行。

    This ignores awk's innate 'read' loop and does all reading explicitly (you could replace BEGIN by END and would get the same result). The logic is closely based on the Perl logic in many ways. Tested on Mac OS X 10.7.5 with both BSD awk and GNU awk. Interestingly, GNU awk insisted on the parentheses in the calls to close where BSD awk did not. The close() calls are necessary in the first loop to make the second loop work at all. The close() calls in the second loop are there to preserve symmetry and for tidiness — but they might also be relevant when you get around to processing a few hundred files in a single run.

    这篇关于许多CSV文件中删除的DUP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆