删除重复的线条,但保持一个没有串一 [英] Remove duplicate lines but keep the one that does not have a string

查看:70
本文介绍了删除重复的线条,但保持一个没有串一的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找了一段时间如何删除我的CSV文件的副本。我开始与多个领域的一个文件,但后来我意识到,我可以用一个文件中使用的第一个字段只是工作有2场,然后合并文件。以下是我想做的事:我有这个文件 CSV文件正如你可以看到有基因的一个以上的描述。他们中有些人有两种描述,一个是假定蛋白等是别的东西。在这种情况下,我想删除一个与假定蛋白,时刻保持与别的东西的路线。但是,如果有一个以上的描述,我可以只保留第一个。我一直在使用awk尝试它。这将是巨大的,如果我可以用AWK吧。

I have been looking for a while how to remove duplicates of my csv files. I started with a file with multiple fields but then I realize that I could just work with one file with 2 field and then merge the files using the first field. Here is what I want to do: I have this file CSV file and as you can see there are genes with more than one description. Some of them have two descriptions, one is "hypothetical protein" and other is "something else". In that case I want to remove the one with "hypothetical protein" and keep the line with "something else". However, if there is more than one description, I can just keep the first one. I have been trying it with awk. It would be great if I could use awk for it.

输入例如:

AAEL018330  hypothetical protein
AAEL018330  tropomyosin, putative
AAEL018331  hypothetical protein
AAEL018332  
AAEL018333  hypothetical protein
AAEL018333  colmedin

输出想要的东西:

AAEL018330  tropomyosin, putative
AAEL018331  hypothetical protein
AAEL018332  
AAEL018333  colmedin

感谢您。

推荐答案

在,如果你想保留通过现场行的最后一项,你可以使用像一般(未分类)情况:

In the general (unsorted) case if you want to keep the last entry of a line by field you can use something like:

awk '{seen[$1]=$0} END {for (i in seen) {print seen[i]}}' file

虽然这不能保证维持排序。

Though that isn't guaranteed to keep sort order.

在这种情况下,排序输入这样的事情应该工作:

In this case, with sorted input something like this should work:

awk 'f!=$1 && line{print line} {f=$1; line=$0} END {print line}' file

这篇关于删除重复的线条,但保持一个没有串一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆