大型CSV文件中的特定列的部分交集 [英] Partial Intersection of Sepecific Columns in Large CSV Files

查看:115
本文介绍了大型CSV文件中的特定列的部分交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个脚本,以便根据每个文件中两个特定列的内容查找大型csv文件的交集:查询ID和主题ID。



一组文件是每个物种的左对和右对,每个文件看起来像这样:

 相似性(%)查询ID主题ID 
100.000000 BRADI5G01462.1_1 BRADI5G16060.1_36
90.000000 BRADI5G02480.1_5 NCRNA_11838_6689
100.000000 BRADI5G06067.1_8 NCRNA_32597_1525
90.000000 BRADI5G08380 .1_12 NCRNA_32405_1776
100.000000 BRADI5G09460.2_17 BRADI5G16060.1_36
90.909091 BRADI5G10680.1_20 NCRNA_2505_6156

正确的文件总是长于和大于左手的文件。



这里是我到目前为止的代码片段:

  import csv 
with open('#Left(Brachypodium_Japonica).csv','r',newline ='')as Afile,open('#Right(Brachypodium_Japonica).csv','r',newline ='')as Bfile,open('Intrsc-(Brachypodium_Japonica).csv','w',newline ='')as Intrsct:
reader1 = csv.reader(Afile,delimiter =\t,skipinitialspace = True)
next(reader1,None)
reader2 = csv.reader(Bfile,delimiter =\\ \\ t,skipinitialspace = True)
next(reader2,None)
Intrsct = csv.writer(Intrsct,delimiter =\t,skipinitialspace = True)
Intrsct.writerow [查询ID,主题ID,左相似度(%),右相似度(%)))
对于row1,row2在zip(Afile,Bfile):
if ([row1 [1] in row2 [1]和row1 [2] in row2 [2])):
Intrsct.writerow([row1.strip()。split('\t')[1] row1.strip()。split('\t')[2],row1.strip()。split('\t')[0],row2.strip [0]])

上面的代码模拟地迭代两个文件的记录,第(1)行的第一文件的行(2),第二文件的行(2)的行(1)的内容;通过它,即按列(比较两个文件中的查询ID以及主题ID),并以一定顺序在新文件上打印匹配。



不是我所期望的;显然它找到第一个想要的列的匹配...我试图手动追溯过程,发现 BRADI5G02480.1_5 例如存在于两个文件,但不是 NCRNA_11838_6689 只存在于左侧正确!



除了数值外,应该是镜像反射吗?!



我使用了这个线程写入脚本,但它逐行比较,不检查列内容的其余部分是否匹配。 p>

此外,我发现,但它使用不适合我的文件的大小的词典和列表。



为了处理模拟迭代,我使用了这个线程,但是有关处理变体大小的文件是什么不是真的清楚我,所以我还没有尝试过!



我真的很感激,如果有人可以告诉我这里丢失了什么,是代码正确还是我使用 in 条件错误?!



请,我真的需要帮助 ...提前感谢:)

解决方案

以下解决方案是我给你的另一个问题的答案的副本,希望能给你一个如何将它与您当前解决方案集成的想法。 / p>

脚本读取两个(或多个)CSV文件,并将行条目写入新的CSV文件。我的意思是如果在 input2.csv 中找到 input1.csv 中的row1,输出等。

  import csv 

files = [input1.csv input.csv]
ldata = []

文件中的文件:
打开(文件r)为f_input:
csv_input = csv。
set_rows = set()
对于csv_input中的行:
set_rows.add(tuple(row))
ldata.append(set_rows)

with open(Intersection(Brachypodium_Japonica).csv,wb)as f_output:
csv_output = csv.writer(f_output,delimiter =\\ \\ t,skipinitialspace = True)
csv_output.writerows(set.intersection(* ldata))

您需要添加您的文件名称。这种格式使测试更容易。使用Python 2.7测试。


I'm working on a script to find the intersection between large csv files based on the contents of only two specific columns in each file which are : Query ID and Subject ID.

A set of files are pairs of Left and Right for each species , every single file looks something like this:

Similarity (%)  Query ID    Subject ID
100.000000  BRADI5G01462.1_1    BRADI5G16060.1_36
90.000000   BRADI5G02480.1_5    NCRNA_11838_6689
100.000000  BRADI5G06067.1_8    NCRNA_32597_1525
90.000000   BRADI5G08380.1_12   NCRNA_32405_1776
100.000000  BRADI5G09460.2_17   BRADI5G16060.1_36
90.909091   BRADI5G10680.1_20   NCRNA_2505_6156

Right files are always longer and larger in size than Left one's !!

Here's the code snippet I have so far :

import csv
with open('#Left(Brachypodium_Japonica).csv', 'r',newline='') as Afile, open('#Right(Brachypodium_Japonica).csv', 'r',newline='') as Bfile, open('Intrsc-(Brachypodium_Japonica).csv','w',newline='') as Intrsct:
    reader1=csv.reader(Afile,delimiter="\t",skipinitialspace=True)
    next(reader1,None)
    reader2=csv.reader(Bfile,delimiter="\t",skipinitialspace=True)
    next(reader2,None)
    Intrsct = csv.writer(Intrsct, delimiter="\t",skipinitialspace=True)
    Intrsct.writerow(["Query ID","Subject ID","Left Similarity (%)","Right Similarity (%)"])
    for row1 ,row2 in zip(Afile,Bfile):
            if ((row1[1] in row2[1] and row1[2] in row2[2])):
                Intrsct.writerow([row1.strip().split('\t')[1],row1.strip().split('\t')[2],row1.strip().split('\t')[0],row2.strip().split('\t')[0]])

The code above is iterating over the records of the two files simulatively and searches for contents of row(1),row(2) of first file in row(1),row(2) of the second file ; by which i.e. column-wise (compares Query ID in both files as well as Subject ID) and prints the matches on a new file in a certain order .

Th results are not exactly what I was expecting ; obviously it finds the matches for the first wanted column only ... I tried to trace back the procedure manually and find that BRADI5G02480.1_5 for instance exist in both files but not NCRNA_11838_6689 which only exists on Left side Not the Right!!

Aren't they supposed to be mirror reflection aside from the numerical values ?!

I have used this thread to write the script but it compares line by line and doesn't check the rest of the column content's for matches .

Also , I found this but it uses dictionaries and lists which isn't suitable for my file's size .

To handle the simulatively iteration thing I used this thread , but what was mentioned there about handling variant sized files wasn't really clear to me so I haven't tried it yet !!

I would really appreciate it if someone could tell me what am missing here , is the code correct or I'm using the in condition wrong ?!

Please , I really need help with this ... thanks in advance :)

解决方案

The following solution is a copy of my answer given to your other question, and should hopefully give you an idea how to integrate it with your current solution.

The script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

import csv

files = ["input1.csv", "input2.csv"]
ldata = []

for file in files:
    with open(file, "r") as f_input:
        csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
        set_rows = set()
        for row in csv_input:
            set_rows.add(tuple(row))
        ldata.append(set_rows)

with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
    csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
    csv_output.writerows(set.intersection(*ldata))

You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

这篇关于大型CSV文件中的特定列的部分交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆