如何从重复值列表中获得唯一值集 [英] how to get unique values set from a repeating values list

查看:138
本文介绍了如何从重复值列表中获得唯一值集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个大的日志文件(平面文件),其中包含两列值(column-A,column-B).

I need to parse a large log file (flat file), which contains two column of values (column-A , column-B).

两列中的值都是重复的.我需要为A列中的每个唯一值查找,我需要找到一组B列值.

Values in both columns are repeating. I need to find for each unique value in column-A , I need to find a set of column-B values.

这是可以使用unix shell命令完成还是需要编写任何perl或python脚本?有什么方法可以做到?

Is this can be done using unix shell command or need to write any perl or python script? What are the ways this can be done?

xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4

输出:

xxxA - 2,1,3
xxxB - 2
xxxC - 3
xxxD - 4

推荐答案

我将使用Python字典,其中字典键是A列值,而字典值是Python内置的

I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values

def parse_the_file():
    lower = str.lower
    split = str.split
    with open('f.txt') as f:
        d = {}
        lines = f.read().split('\n')
        for A,B in [split(l) for l in lines]:
            try:
                d[lower(A)].add(B)
            except KeyError:
                d[lower(A)] = set(B)

        for a in d:
            print "%s - %s" % (a,",".join(list(d[a])))

if __name__ == "__main__":
    parse_the_file()

使用字典的优点是,每列A值只有一个字典键.使用集合的优点是您将拥有一组唯一的B列值.

The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.

效率说明:

  • The use of try-catch is more efficient than using an if\else statement to check for initial cases.
  • The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
  • Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using a = lower(A) before the try catch statement
  • I used a function, as accessing local variables is more efficient in Python than accessing global variables
  • Some of these performance tips are from here

在您的输入示例中测试上面的代码会产生:

Testing the code above on your input example yields:

xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3

这篇关于如何从重复值列表中获得唯一值集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆