Python仅使用部分(而不是全部)列删除重复项 [英] Python to remove duplicates using only some, not all, columns

查看:250
本文介绍了Python仅使用部分(而不是全部)列删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个制表符分隔的input.txt文件,

I have a tab-delimited input.txt file like this

A    B    C
A    B    D
E    F    G
E    F    T
E    F    K

这些用制表符分隔。

我只想在多行具有相同的第一列和第二列时才删除重复项。

I want to remove duplicates only when multiple rows have the same 1st and 2nd columns.

因此,即使第一列和第二列在第三列中不同,它们的第一列和第二列也相同,因此我想删除以后出现的 ABD。

So, even though 1st and 2nd rows are different in 3rd column, they have the same 1st and 2nd columns, so I want to remove "A B D" that appears later.

所以output.txt会是这样。

So output.txt will be like this.

A    B    C
E    F    G

如果我要删除重复项,只需将列表放入设置功能,一切都准备就绪。

If I was to remove duplicates in usual way, I just make the lists into "set" function, and I am all set.

但是现在我试图仅使用某些列删除重复项。

But now I am trying to remove duplicates using only "some" columns.

使用excel,就这么简单。

Using excel, it's just so easy.

数据->删除重复项->选择列

Data -> Remove Duplicates -> Select columns

使用MatLab,这也很容易。

Using MatLab, it's easy, too.

导入input.txt->对第一和第二列使用唯一功能->删除编号为 1的行

import input.txt -> Use "unique" function with respect to 1st and 2nd columns -> Remove the rows numbered "1"

但是使用python时,我找不到方法,因为我所知道的关于删除重复项的全部操作都是在python中使用 set。

But using python, I couldn't find how to do this because all I knew about removing duplicate was using "set" in python.

===

这是我根据undefined_is_not_a_function的答案进行的实验。

This is what I experimented following undefined_is_not_a_function's answer.

我不确定如何将结果覆盖到output.txt,以及如何更改代码以让我指定用于重复删除的列(例如3和5)。

I am not sure how to overwrite the result to output.txt, and how to alter the code to let me specify the columns to use for duplicate-removing (like 3 and 5).

import sys
input = sys.argv[1]

seen = set()
data = []
for line in input.splitlines():
    key = tuple(line.split(None, 2)[0])
    if key not in seen:
        data.append(line)
        seen.add(key)


推荐答案

您应使用 itertools.groupby 。在这里,我根据前两列对数据进行分组,然后使用 next() 从每个组中获取第一项。

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.

>>> from itertools import groupby                                   
>>> s = '''A    B    C                                              
A    B    D
E    F    G
E    F    T
E    F    K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
    print next(g)
...     
A    B    C
E    F    G

只需替换 s.splitlines()如果输入来自文件,则带有文件对象。

Simply replace s.splitlines() with file object if input is coming from a file.

请注意,上述解决方案仅在以下情况下有效数据会按照前两列进行排序,如果不是这种情况,则必须在此处使用设置

Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.

>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A    B    C
A    B    D
E    F    G
E    F    T
E    F    K
A    B    F'''     
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
...     key = ig(line.split())
...     if key not in seen:
...         data.append(line)
...         seen.add(key)
...         
>>> data
['A    B    C', 'E    F    G']

这篇关于Python仅使用部分(而不是全部)列删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆