用于合并具有匹配第一个字段的行的命令行，50 GB 输入 [英] Command line to merge lines with matching first field, 50 GB input

查看：11 发布时间：2021/12/24 12:24:07 regex optimization awk sed

本文介绍了用于合并具有匹配第一个字段的行的命令行，50 GB 输入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

不久前，我问了一个关于合并具有共同第一个字段的行的问题.这是原文:匹配的命令行具有匹配第一个字段(sed、awk 等)的行

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)

样本输入:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

所需的输出:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

这个想法是，如果第一个字段匹配，则合并行.输入已排序.实际内容更复杂，但使用管道作为唯一分隔符.

The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.

上一个问题中提供的方法在我的 0.5GB 文件上运行良好，处理时间约为 16 秒.但是，我的新文件大约大 100 倍，我更喜欢流式传输的方法.理论上，这将能够在大约 30 分钟内运行.先前的方法在运行 24 小时后未能完成.

The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.

在 MacOS(即 BSD 类型的 unix)上运行.

Running on MacOS (i.e., BSD-type unix).

想法?[注意，先前问题的先前答案不是单行的.]

Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]

推荐答案

您可以即时将结果附加到文件中，这样您就不需要构建 50GB 的数组(我假设您没有记忆！).此命令将连接字符串中每个不同索引的连接字段，该字符串将写入以相应索引命名的文件中，并带有一些后缀.

You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.

根据 OP 的评论内容可能有空格，我建议使用 -F"|" 而不是 sub并且以下答案旨在写入标准输出

on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out

(新)代码:

# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'

这会在给定索引中构建一串数据"，然后在索引更改时将其打印出来并开始在新索引上构建下一个字符串，直到该字符串结束...重复...

This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...

这篇关于用于合并具有匹配第一个字段的行的命令行，50 GB 输入的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于合并具有匹配第一个字段的行的命令行，50 GB 输入 [英] Command line to merge lines with matching first field, 50 GB input

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于合并具有匹配第一个字段的行的命令行，50 GB 输入 [英] Command line to merge lines with matching first field, 50 GB input

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭