如何在python中过滤大文件中的重叠行 [英] How to filter overlap rows in a big file in python

查看:52
本文介绍了如何在python中过滤大文件中的重叠行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 python 中过滤大文件中的重叠行.重叠度设置为 25%.换句话说,任何两行之间相交的元素数小于它们并集的0.25倍.如果大于0.25,则删除一行.所以如果我有一个总共有1000 000行的大文件,第一个5行如下:

<块引用>

c6 c24 c32 c54 c67
c6 c24 c32 c51 c68 c78
c6 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75
c6 c32 c53 c67

因为第一行和第二行相交的元素个数为3,(如c6,c24,c32),它们之间的并集个数为8,(如c6,c24,c32,c54,c67,c51,c68,c78).重叠度为3/8=0.375 > 0.25,第2行被删除,第3行和第5行也是如此,最终答案是第1行和第4行.

<块引用>

c6 c24 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75

伪代码如下:

<块引用>

for i=1:(n-1) #n为大文件的行数对于 j=(i+1):n如果第 i 行和第 j 行的重叠度大于 0.25从大文件中删除第 j 行结尾结尾

结束

如何在python中解决这个问题?谢谢!

解决方案

棘手的部分是您必须修改正在迭代的列表,并且仍然跟踪两个索引.一种方法是倒退,因为删除索引等于或大于您跟踪的索引的项目不会影响它们.

此代码未经测试,但您明白了:

 with open("file.txt") as fileobj:设置 = [set(line.split()) for line in fileobj]对于范围内的 first_index(len(sets) - 2, -1, -1):对于范围内的 second_index(len(sets) - 1, first_index, -1):union = set[first_index] |集[second_index]交集=集[first_index] &集[second_index]如果 len(intersection)/float(len(union)) >0.25:删除集[second_index]使用 open("output.txt", "w") 作为文件对象:对于集合中的 set_:# 集合的顺序未定义,所以我们需要对每个集合进行排序输出 = " ".join(sorted(set_, key=lambda x: int(x[1:])))fileobj.write("{0}\n".format(output))

既然很明显如何对每一行的元素进行排序,我们可以这样做.如果订单以某种方式自定义,我们必须将读取行与每个集合元素耦合,以便我们可以准确地写回最后读取的行,而不是重新生成它.

I am trying to filter overlap rows in a big file in python.The overlap degrees is set to 25%. In other words,the number of element of intersection between any two rows is less than 0.25 times of union of them.if more than 0.25,one row is deleted.So if I have a big file with 1000 000 rows in total, the first 5 rows are as follows:

c6 c24 c32 c54 c67
c6 c24 c32 c51 c68 c78
c6 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75
c6 c32 c53 c67

Because the number of element of intersection between the 1st row and 2nd row is 3,(such as c6,c24,c32 ),the number of union between them is 8,(such as c6,c24,c32,c54,c67,c51,c68,c78). The overlap degrees is 3/8=0.375 > 0.25,the 2nd row is deleted.so do the 3rd and 5th rows.The final answer is the 1st and 4th row.

c6 c24 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75

The pseudo code are as follows:

for i=1:(n-1)    # n is the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith row and jth row is more than 0.25
          delete the jth row from the big file
        end
   end

end

how to solve this problem in python? Thank you!

解决方案

The tricky part is that you have to modify the list you're iterating over and still keep track of two indices. One way to do that is to go backwards, since deleting an item with index equal to or larger than the indices you keep track of will not influence them.

This code is untested, but you get the idea:

with open("file.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 2, -1, -1):
        for second_index in range(len(sets) - 1, first_index, -1):
            union = sets[first_index] | sets[second_index]
            intersection = sets[first_index] & sets[second_index]
            if len(intersection) / float(len(union)) > 0.25:
                del sets[second_index]
with open("output.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(sorted(set_, key=lambda x: int(x[1:])))
        fileobj.write("{0}\n".format(output))

Since it's obvious how to sort the elements of each line we could do it like this. If the order was somehow custom, we'd have to couple the read line with each set element so that we could write back exactly the line that was read at the end, instead of regenerating it.

这篇关于如何在python中过滤大文件中的重叠行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆