如何过滤python中大文件中两行的重叠 [英] How to filter overlap of two rows in a big file in python

查看:63
本文介绍了如何过滤python中大文件中两行的重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 python 中过滤一个大文件中的重叠行.

重叠度设置为两行和其他两行的25%.换句话说,重叠度为a*b/(c+da*b)>0.25a交点 第 1 行和第 3 行之间,b 是第 2 行和第 4 行之间交点 的数量,c 是数量第一行元素个数乘以第二行元素个数,d为第三行元素个数乘以第四行元素个数.如果重叠度大于 0.25,则删除第 3 行和第 4 行.因此,如果我有一个总共有 1000 000 行的大文件,则前 6 行如下:

<块引用>

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63

因为第一两行和第二行的重叠度为a=3,(如c6,c24,c32),b=3,(如k6,k12,k63), c=25,d=24,a*b/(c+da*b)=9/40<0.25,第3、4行不删除.接下来第一两行和第三两行的重叠度为5*4/(25+28-5*4)=0.61>0.25,删除第三两行.
最终答案是第一和第二两行.

<块引用>

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63

伪代码如下:

<块引用>

for i=1:(n-1) #n是大文件行数的一半对于 j=(i+1):n如果第 i 两行和第 j 两行的重叠度大于 0.25从大文件中删除第 j 两行结尾结尾结尾

python 代码如下.但是是错误的.如何解决?

<块引用>

 with open("iuputfile.txt") as fileobj:设置 = [set(line.split()) for line in fileobj]对于范围内的 first_index(len(sets) - 4, -2, -2):c=len(sets[first_index])*len(sets[first_index+1])对于范围内的 second_index (len(sets)-2 , first_index, -2):d=len(sets[second_index])*len(sets[second_index+1])ab = len(sets[first_index]|sets[second_index])*len(sets[first_index+1]|sets[second_index+1])如果(ab/(c+d-ab))>0.25:删除集[second_index]删除集[second_index+1]使用 open("outputfile.txt", "w") 作为 fileobj:对于集合中的 set_:# 集合的顺序未定义,所以我们需要对每个集合进行排序输出 = " ".join(set_)fileobj.write("{0}\n".format(output))

类似的问题可以在中找到https://stackoverflow.com/questions/17321275/

如何修改该代码以解决 Python 中的这个问题?谢谢!

解决方案

我一直在思考如何以更好的方式解决这个问题,没有所有的逆向和索引之类的东西,我想出了一个恕我直言,该解决方案更长、更复杂,但更易于阅读、更漂亮、更易于维护和扩展.

首先,我们需要一种特殊的列表,即使其中的项目被删除,我们也可以正确"迭代它.这里是一篇博文,详细介绍了列表和迭代器是如何工作的,阅读它会帮助你理解这里发生了什么:

class SmartList(list):def __init__(self, *args, **kwargs):super(SmartList, self).__init__(*args, **kwargs)self.iterators = []def __iter__(self):返回 SmartListIter(self)def __delitem__(self, index):super(SmartList, self).__delitem__(index)对于 self.iterators 中的迭代器:iterator.item_deleted(index)

我们扩展内置的 list 并使其返回自定义迭代器而不是默认迭代器.每当列表中的项目被删除时,我们调用 self.iterators 列表中每个项目的 item_deleted 方法.下面是 SmartListIter 的代码:

class SmartListIter(object):def __init__(self, smartlist, index=0):self.smartlist = 智能列表smartlist.iterators.append(self)self.index = 索引def __iter__(self):回归自我定义下一个(自己):尝试:item = self.smartlist[self.index]除了索引错误:self.smartlist.iterators.remove(self)引发停止迭代索引 = self.indexself.index += 1返回(索引,项目)def item_deleted(self, index):如果 index >= self.index:返回self.index -= 1

因此迭代器将自己添加到迭代器列表中,并在完成后将自己从同一列表中删除.如果删除了索引小于当前索引的项目,我们会将当前索引减一,这样我们就不会像普通列表迭代器那样跳过项目.

next 方法返回一个元组 (index, item) 而不仅仅是项目,因为当需要使用这些类时,这会让事情变得更容易——我们赢了不必纠结于 enumerate.

所以这应该解决必须倒退的问题,但是我们仍然必须使用大量索引来在每个循环中的四个不同行之间进行切换.由于两行和两行在一起,让我们为此创建一个类:

class LinePair(object):def __init__(self, pair):self.pair = 对self.sets = [set(line.split()) for line in pair]self.c = len(self.sets[0]) * len(self.sets[1])定义重叠(自我,其他):ab = float(len(self.sets[0] & other.sets[0]) * \len(self.sets[1] & other.sets[1]))重叠 = ab/(self.c + other.c - ab)返回重叠def __str__(self):返回 "".join(self.pair)

pair 属性是直接从输入文件中读取的两行元组,并带有换行符.我们稍后使用它将该对写回文件.我们还将两行都转换为一个集合并计算 c 属性,这是每对行的一个属性.最后,我们创建了一种方法来计算一个 LinePair 和另一个 LinePair 之间的重叠.注意 d 不见了,因为那只是另一对的 c 属性.

现在是大结局:

from itertools import izip使用 open("iuputfile.txt") 作为 fileobj:对 = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])对于 first_index, first_pair 成对:对于 SmartListIter(pairs, first_index + 1) 中的 second_index, second_pair:如果 first_pair.overlap(second_pair) >0.25:德尔对[second_index]使用 open("outputfile.txt", "w") 作为 fileobj:对于索引,成对配对:fileobj.write(str(pair))

请注意阅读这里的中央循环是多么容易,而且它是多么短.如果您将来需要更改此算法,使用此代码可能比使用我的其他代码更容易完成.izip 用于将输入文件的两行和两行分组,如此处.

I am trying to filter overlap rows in a big file in python.

The overlap degrees is set to 25% of two rows and the other two rows. In other words, the overlap degrees is a*b/(c+d-a*b)>0.25, a is the number of intersection between the 1st row and 3rd row, b is the number of intersection between the 2nd row and 4th row, c is the number of elements of the 1st row multiplied by the number of elements of the 2nd row, d is the number of elements of the 3rd row multiplied by the number of elements of the 4th row . If the overlap degrees is more than 0.25,the 3rd and 4th rows are deleted. So if I have a big file with 1000 000 rows in total, the first 6 rows are as follows:

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63

Because the overlap degrees of the 1st two rows and 2nd rows is a=3, (such as c6,c24,c32),b=3,(such as k6,k12,k63), c=25,d=24,a*b/(c+d-a*b)=9/40<0.25,the 3rd and 4th rows are not deleted. Next the overlap degrees of the 1st two rows and 3rd two rows is 5*4/(25+28-5*4)=0.61>0.25,the 3rd two rows are deleted.
The final answer is the 1st and 2nd two rows.

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63

The pseudo code are as follows:

for i=1:(n-1)    # n is a half of the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith two rows and jth two rows is more than 0.25
          delete the jth two rows from the big file
        end
    end
end

The python code are as follows.But it is wrong. How to fix it?

with open("iuputfile.txt") as fileobj: 
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c=len(sets[first_index])*len(sets[first_index+1])
        for second_index in range(len(sets)-2 , first_index, -2):
            d=len(sets[second_index])*len(sets[second_index+1])
            ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1])
            if (ab/(c+d-ab))>0.25:
                del sets[second_index]
                del sets[second_index+1]
with open("outputfile.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(set_)
        fileobj.write("{0}\n".format(output))

The similar problem can be found in https://stackoverflow.com/questions/17321275/

How to modify that code to solve this problem in Python? Thank you!

解决方案

I've been thinking about how to solve this problem in a better way, without all the reversing and indexing and stuff, and I've come up with a solution that's longer and more involved, but easier to read, prettier, more maintainable and extendable, IMHO.

First we need a special kind of list that we can iterate over "correctly", even if an item in it is deleted. Here is a blog post going into more detail about how lists and iterators work, and reading it will help you understand what's going on here:

class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:
            iterator.item_deleted(index)

We extend the built-in list and make it return a custom iterator instead of the default. Whenever an item in the list is deleted, we call the item_deleted method of every item in the self.iterators list. Here's the code for SmartListIter:

class SmartListIter(object):
    def __init__(self, smartlist, index=0):
        self.smartlist = smartlist
        smartlist.iterators.append(self)
        self.index = index

    def __iter__(self):
        return self

    def next(self):
        try:
            item = self.smartlist[self.index]
        except IndexError:
            self.smartlist.iterators.remove(self)
            raise StopIteration
        index = self.index
        self.index += 1
        return (index, item)

    def item_deleted(self, index):
        if index >= self.index:
            return
        self.index -= 1

So the iterator adds itself to the list of iterators, and removes itself from the same list when it is done. If an item with an index less than the current index is deleted, we decrement the current index by one so that we won't skip an item like a normal list iterator would do.

The next method returns a tuple (index, item) instead of just the item, because that makes things easier when it's time to use these classes -- we won't have to mess around with enumerate.

So that should take care of having to go backwards, but we're still having to use a lot of indexes to juggle between four different lines in every loop. Since two and two lines go together, let's make a class for that:

class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

The pair attribute is a tuple of two lines read directly from the input file, complete with newlines. We use it later to write that pair back to a file. We also convert both lines to a set and calculate the c attribute, which is a property of every pair of lines. Finally we make a method that will compute the overlap between one LinePair and another. Notice that d is gone, since that is just the c attribute of the other pair.

Now for the grand finale:

from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:
        fileobj.write(str(pair))

Notice how easy it is to read the central loop here, and how short it is. If you need to change this algorithm in the future, it's likely much more easily done with this code than with my other code. The izip is used to group two and two lines of the input file, as explained here.

这篇关于如何过滤python中大文件中两行的重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆