在bash/ubuntu中数百万个字符串上快速md5sum [英] fast md5sum on millions of strings in bash/ubuntu

查看：181 发布时间：2020/5/8 0:36:18 ubuntu md5 md5sum

本文介绍了在bash/ubuntu中数百万个字符串上快速md5sum的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在ubuntu上的bash脚本中，我需要约300万个字符串的MD5总和. 300万个字符串-> 300万个MD5哈希值.简单的实现每个字符串大约需要0.005秒.超过4个小时.存在哪些更快的替代方案?有没有办法将字符串组泵入md5sum?

I need the MD5 sums of 3 million strings or so in a bash script on ubuntu. 3 million strings -> 3 million MD5 hashes. The trivial implementation takes about 0.005sec per string. That's over 4 hours. What faster alternatives exist? Is there a way to pump groups of strings into md5sum?

#time md5sum running 100 times on short strings
#each iteration is ~0.494s/100 = 0.005s
time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null

real    0m0.494s
user    0m0.120s
sys     0m0.356s

一个好的解决方案将包括一个bash/Perl脚本，该脚本从stdin中获取字符串列表，并输出其MD5哈希值列表.

A good solution will include a bash/Perl script that takes a list of strings from stdin and outputs a list of their MD5 hashes.

推荐答案

使用许多md5实现中的任何一种在C(或Perl或Python)中都不难-md5的本质是一个哈希函数字符向量到字符向量.

It's not hard to do in C (or Perl or Python) using any of the many md5 implementations -- at its heart md5 is a hash function that goes from a character vector to a character vector.

因此，只需编写一个外部程序，该程序读取您的300万个字符串，然后将它们一个一个地馈送到您选择的md5实现中.这样一来，您只需启动一个程序，而不是300万，那么仅此一项就可以节省您的时间.

So just write a outer program that reads your 3 million strings, and then feed them one by one to the md5 implementation of your choice. That way you have one program startup rather than 3 million, and that alone will save you time.

FWIW在一个项目中，我使用了Christophe Devine的md5实现(用C语言编写)，也有OpenSSL，并且我相信CPAN也将为Perl提供许多实现.

FWIW in one project I used the md5 implementation (in C) by Christophe Devine, there is OpenSSL's as well and I am sure CPAN will have a number of them for Perl too.

好的，无法抗拒.我提到的md5实现例如在小型tarball 中.取文件md5.c并用其替换底部的(#ifdef'ed)main()

Ok, couldn't resist. The md5 implementation I mentioned is e.g. inside this small tarball. Take the file md5.c and replace the (#ifdef'ed out) main() at the bottom with this

int main( int argc, char *argv[] ) {
    FILE *f;
    int j;
    md5_context ctx;
    unsigned char buf[1000];
    unsigned char md5sum[16];

    if( ! ( f = fopen( argv[1], "rb" ) ) ) {
        perror( "fopen" );
        return( 1 );
    }

    while( fscanf(f, "%s", buf) == 1 ) {
        md5_starts( &ctx );
        md5_update( &ctx, buf, (uint32) strlen((char*)buf) );
        md5_finish( &ctx, md5sum );

        for( j = 0; j < 16; j++ ) {
            printf( "%02x", md5sum[j] );
        }
        printf( " <- %s\n", buf );
    }
    return( 0 );
}

构建一个简单的独立程序，例如在

build a simple standalone program as e.g. in

/tmp$ gcc -Wall -O3 -o simple_md5 simple_md5.c

然后您会得到:

# first, generate 300,000 numbers in a file (using 'little r', an R variant)
/tmp$ r -e'for (i in 1:300000) cat(i,"\n")' > foo.txt

# illustrate the output
/tmp$ ./simple_md5 foo.txt | head
c4ca4238a0b923820dcc509a6f75849b <- 1
c81e728d9d4c2f636f067f89cc14862c <- 2
eccbc87e4b5ce2fe28308fd9f2a7baf3 <- 3
a87ff679a2f3e71d9181a67b7542122c <- 4
e4da3b7fbbce2345d7772b0674a318d5 <- 5
1679091c5a880faf6fb5e6087eb1b2dc <- 6
8f14e45fceea167a5a36dedd4bea2543 <- 7
c9f0f895fb98ab9159f51fd0297e236d <- 8
45c48cce2e2d7fbdea1afc51c7c6ad26 <- 9
d3d9446802a44259755d38e6d163e820 <- 10

# let the program rip over it, suppressing stdout
/tmp$ time (./simple_md5 foo.txt > /dev/null)

real    0m1.023s
user    0m1.008s
sys     0m0.012s
/tmp$

因此，对于300,000个(短)字符串来说，这大约是一秒钟.

So that's about a second for 300,000 (short) strings.

这篇关于在bash/ubuntu中数百万个字符串上快速md5sum的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在bash/ubuntu中数百万个字符串上快速md5sum [英] fast md5sum on millions of strings in bash/ubuntu

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在bash/ubuntu中数百万个字符串上快速md5sum [英] fast md5sum on millions of strings in bash/ubuntu

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭