如何从Linux中的一个很大的文件中获得唯一的行? [英] How get unique lines from a very large file in linux?

查看：137 发布时间：2020/4/29 3:31:26 linux large-files uniq

本文介绍了如何从Linux中的一个很大的文件中获得唯一的行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的数据文件(255G； 3,192,563,934行).不幸的是，我的设备上只有204G的可用空间(没有其他我可以使用的设备).我做了一个随机样本，发现在给定的100K行中，大约有10K唯一行...但是文件未排序.

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.

通常我会说:

pv myfile.data | sort | uniq > myfile.data.uniq

并使其运行一天左右.在这种情况下，这将无法正常工作，因为我的设备上没有足够的空间来存放临时文件.

and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.

我当时想我可以使用split，并且一次可以将500k行上的uniq流式传输到一个新文件中.有没有办法做类似的事情?

I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?

我认为我也许可以做类似的事情

I thought I might be able to do something like

tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data

但是我想不出一种方法来正确截断文件.

but I couldn't figure out a way to truncate the file properly.

如何从Linux中的一个很大的文件中获得唯一的行? [英] How get unique lines from a very large file in linux?

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

如何从Linux中的一个很大的文件中获得唯一的行? [英] How get unique lines from a very large file in linux?

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭