命令行合并相匹配的第一场,50 GB输入线 [英] Command line to merge lines with matching first field, 50 GB input

查看:158
本文介绍了命令行合并相匹配的第一场,50 GB输入线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

前阵子,我问了有关合并其有一个共同的第一场线的问题。下面是原文:<一href=\"http://stackoverflow.com/questions/18493326/command-line-to-match-lines-with-matching-first-field-sed-awk-etc\">Command线匹配线,第一场(SED,AWK等)配套

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)

样品输入:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

所需的输出:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

的想法是,如果第一字段匹配,则该行被合并。输入被排序。实际含量更复杂,但使用管道作为唯一的分隔符。

The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.

在之前的问题提供的方法在〜16秒,我的0.5GB文件,处理效果很好。不过,我的新文件约100倍大,而我preFER该流的方法。在理论上,这将能够在运行约30分钟。现有方法失败运行24小时后完成。

The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.

运行在MacOS(即BSD类UNIX)。

Running on MacOS (i.e., BSD-type unix).

想法? [注,现有的答案现有问题不是一个班轮。]

Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]

推荐答案

您可以附加你的结果在飞行的文件,这样你就不需要建一个50GB数组(我假设你没有内存!)。此命令将串联连接字段每个字符串中的不同指标被写入到一个与一些后缀相应索引的名字命名的文件。

You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.

编辑: OP的评论的基础上的内容可能有空格,我会建议使用 -F|而不是以及其后的答案是专写到标准输出

on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out

(新)code:

# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}' 

此建立数据的字符串,而在一个给定的指标,然后打印出来,当指数的变化,并开始建设的新指数下一个字符串,直到一个结束... ...重复

This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...

这篇关于命令行合并相匹配的第一场,50 GB输入线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆