Python 列表、csv、重复删除 [英] Python lists, csv, duplication removal

查看:37
本文介绍了Python 列表、csv、重复删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管我对 python 很陌生,但我无法理解我如何无法解决这个问题/采取正确的方法.因此,非常感谢任何帮助,链接到有用的教程,因为我必须不时做此类事情.

Even though I'm very new with python, I can't understand how I haven't been able to solve this issue / take a right approach. So any help, link to a helpful tutorial is appreciated highly as I have to do this kind of stuff from time to time.

我有一个 CSV 文件,需要稍微重新格式化/修改.

I have a CSV file that I need to reformat / modify a bit.

我需要存储基因所在的样本数量.

I need to store the amount of samples that the gene is in.

输入文件:

AHCTF1: Sample1, Sample2, Sample4
AHCTF1: Sample2, Sample7, Sample12
AHCTF1: Sample5, Sample6, Sample7

结果:

 AHCTF1 in 7 samples (Sample1, Sample2, Sample4, Sample5, Sample6, Sample7, Sample12)

代码:

f = open("/CSV-sorted.csv")
gene_prev = ""

hit_list = []

csv_f = csv.reader(f)

for lines in csv_f:

    #time.sleep(0.1)
    gene = lines[0]
    sample = lines[11].split(",")
    repeat = lines[8]

    for samples in sample:
        hit_list.append(samples)

    if gene == gene_prev:

        for samples in sample:

            hit_list.append(samples)

        print gene
        print hit_list
        print set(hit_list)
        print "samples:", len(set(hit_list))


    hit_list = []

    gene_prev = gene

所以简而言之,我想将每个基因的命中组合起来,并从中制作一组以消除重复.

So in a nutshell I'd like to combine the hits for every gene and make a set from them to remove duplications.

也许字典是一种方法:将 ave 基因作为键并添加样本作为值?

Maybe dictionary would be the way to do it:s ave gene as a key and add samples as values?

发现这个 - 类似/有用:如何在python中将字典与相同的键组合在一起?

Found this - Similar / useful: How can I combine dictionaries with the same keys in python?

推荐答案

删除重复项的标准方法是转换为 set.

The standard way to remove duplicates is to convert to a set.

但是我认为您阅读文件的方式有些问题.第一个问题:它不是一个 csv 文件(前两个字段之间有一个冒号).其次是什么

However I think there's some stuff wrong with the way you're reading the file. First problem: it isn't a csv file (you have a colon between the first two fields). Second what is

gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]

应该做什么?

如果我写这篇文章,我会将:"替换为另一个,".因此,通过此修改并使用集合字典,您的代码将类似于:

If I was writing this I would replace the ":" with another ",". So with this modification and using a dictionary of sets your code would look something like:

# Read in csv file and convert to list of list of entries. Use with so that 
# the file is automatically closed when we are done with it
csvlines = []
with open("CSV-sorted.csv") as f:
    for line in f:
        # Use strip() to clean up trailing whitespace, use split() to split
        # on commas.
        a = [entry.strip() for entry in line.split(',')]
        csvlines.append(a)

# I'll print it here so you can see what it looks like:
print(csvlines)



# Next up: converting our list of lists to a dict of sets.

# Create empty dict
sample_dict = {}

# Fill in the dict
for line in csvlines:
    gene = line[0] # gene is first entry
    samples = set(line[1:]) # rest of the entries are samples

    # If this gene is in the dict already then join the two sets of samples
    if gene in sample_dict:
        sample_dict[gene] = sample_dict[gene].union(samples)

    # otherwise just put it in
    else:
        sample_dict[gene] = samples


# Now you can print the dictionary:
print(sample_dict)

输出为:

[['AHCTF1', 'Sample1', 'Sample2', 'Sample4'], ['AHCTF1', 'Sample2', 'Sample7', 'Sample12'], ['AHCTF1', 'Sample5', 'Sample6', 'Sample7']]
{'AHCTF1': {'Sample12', 'Sample1', 'Sample2', 'Sample5', 'Sample4', 'Sample7', 'Sample6'}}

其中第二行是您的字典.

where the second line is your dictionary.

这篇关于Python 列表、csv、重复删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆