如何编辑300 GB的文本文件(基因组数据)? [英] How to edit 300 GB text file (genomics data)?

查看:171
本文介绍了如何编辑300 GB的文本文件(基因组数据)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个300 GB的文本文件,其中包含超过250k条记录的基因组数据.有些记录有不良数据,我们的基因组程序"Popoolution"使我们可以用星号注释掉不良"记录.我们的问题是,我们找不到可以加载数据的文本编辑器,以便可以注释掉不良记录.有什么建议?我们有Windows和Linux盒子.

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.

更新:更多信息

Popoolution程序( https://code.google.com/p/popoolation/)达到不良"记录时崩溃,为我们提供行号,然后我们可以将其注释掉.具体来说,我们从Perl收到一条消息,内容为"F#€%&脚手架".该手册建议我们可以仅使用星号注释掉坏处.遗憾的是,我们将不得不重复多次此过程...

The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...

再想一想...是否有一种方法可以使我们在不立即打​​开整个文本文件的情况下将星号添加到行中.鉴于我们将不得不重复执行该过程的次数未知,因此这可能非常有用.

One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

推荐答案

基于您的更新:

再想一想...是否有一种方法可以让我们添加 将星号标记到该行,而无需立即打开整个文本文件. 鉴于我们将不得不重复该操作,因此这可能非常有用. 处理未知次数.

One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

您可以采用以下方法:如果您知道行号,则可以在该行的开头添加一个星号,表示:

Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:

sed 'LINE_NUMBER s/^/*/' file

查看示例:

$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee

如果添加-i,则文件将被更新:

If you add -i, the file will be updated:

$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee

即使我一直认为重定向到另一个文件更好

Even though I always think it's better to do a redirection to another file

sed '3 s/^/*/' file > new_file

以便保留原始文件并将更新后的文件保存在new_file中.

so that you keep intact your original file and save the updated one in new_file.

这篇关于如何编辑300 GB的文本文件(基因组数据)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆