从python中的元组列表中删除重复的功能 [英] Function to remove duplicates from a list of tuples in python

查看:198
本文介绍了从python中的元组列表中删除重复的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在函数sqlPull()中,我每5秒从MySQL数据库中提取最近的5个条目。在第二个函数dupCatch()中,我试图删除与n相比n + 1 SQL拉的重复项。我想保存唯一的元组列表,但是现在功能是每5秒打印一次元组5次。

In the function sqlPull() I pull the most recent 5 entries from a MySQL database every 5 seconds. In the second function dupCatch() I am attempting to remove duplicates that would in the n+1 SQL pull when compared to n. I want to save only the unique list of tuples, but right now the function is printing the same list of tuples 5 times every five seconds.

在英文中,我尝试使用dupCatch()的方法是从sqlPull(),initialize和empty列表中获取数据,并对变量中的所有元组进行说明数据如果该元组不在空列表中,则将其添加到newData变量,否则,将lastPull设置为非唯一元组。

In english what I am attempting to do with dupCatch() is take the data from sqlPull(), initialize and empty list and say for all of the tuples in the variable data if that tuple is not in the empty list, add it to the newData variable, if not, set lastPull equal to the non-unique tuples.

显然,我的函数是错误的,但我不知道如何解决它。

Obviously, my function is wrong, but I'm not sure how to fix it.

import mysql.connector
import datetime
import requests
from operator import itemgetter
import time

run = True

def sqlPull():
    connection = mysql.connector.connect(user='XXX', password='XXX', host='XXXX', database='MeshliumDB')
    cursor = connection.cursor()
    cursor.execute("SELECT TimeStamp, MAC, RSSI FROM wifiscan ORDER BY TimeStamp DESC LIMIT 5;")
    data = cursor.fetchall()
    connection.close()
    time.sleep(5)
    return data

def dupCatch():
    data = sqlPull()
    lastPull = []
    for (TimeStamp, MAC, RSSI) in data:
        if (TimeStamp, MAC, RSSI) not in lastPull:
            newData = data
        else:
            lastPull = data
        print newData

while run == True:
    dupCatch()

这是我现在得到的输出如下:

This is what the output I am getting now looks like:

[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]


推荐答案

而不是试图弄清我们的代码是什么,并修复它,让我们回到你的英文描述:

Rather than try to figure our what your code is trying to do and fix it, let's go back to your English description:


在英文中我正在尝试用dupCatch()从sqlPull()获取数据,初始化和空列表,并为变量数据中的所有元组表示if那个tupl e不在空列表中,将其添加到newData变量中,如果没有,则将lastPull设置为非唯一元组。

In english what I am attempting to do with dupCatch() is take the data from sqlPull(), initialize and empty list and say for all of the tuples in the variable data if that tuple is not in the empty list, add it to the newData variable, if not, set lastPull equal to the non-unique tuples.

所以

seen = set()
def dupCatch():
    data = sqlPull()
    new_data = []
    for (TimeStamp, MAC, RSSI) in data:
        if (TimeStamp, MAC, RSSI) not in seen:
            seen.add((TimeStamp, MAC, RSSI))
            new_data.append((TimeStamp, MAC, RSSI))
    print new_data

或者更简洁:

seen = set()
def dupCatch():
    data = sqlPull()
    newData = [row for row in data if row not in seen]
    seen.update(newData)
    print new_data

无论哪种方式,这里的诀窍是,我们有一套跟踪我们见过的每一行。所以,对于每一行,如果它在这个集合,我们已经看到它,可以忽略它;否则,我们不得不忽略它,并将其添加到该集合中。

Either way, the trick here is that we have a set which keeps track of every row we've ever seen. So, for each new row, if it's in that set, we've seen it and can ignore it; otherwise, we have to not ignore it, and add it to the set for later.

第二个版本只是通过一次过滤所有5行来简化事情,然后 update - 同时使用所有新的集合,而不是一行一行。

The second version just simplifies things by filtering all 5 rows at once, and then update-ing the set with all of the new ones at once, instead of doing it row by row.

看到必须是全球化的原因是,全球人生活永远在整个功能的运行中,所以我们可以使用它来跟踪我们见过的每一行;如果我们把它当作功能,那么每次都是新的,所以我们只会跟踪我们在当前批次中看到的行,这不是非常有用。

The reason that seen has to be global is that a global lives forever, across all runs of the function, so we can use it to keep track of every row we've ever seen; if we made it local to the function, it would be new each time, so we'd only be keeping track of rows we've seen in the current batch, which isn't very useful.

一般来说,全局变量是坏的。但是,像持久缓存这样的事情是一般规则的例外。他们的全部意见是他们不是本地。如果你有一个有意义的对象模型,看到会更好,因为任何对象的成员 dupCatch 是一种比全球化的方法。如果你有一个很好的理由将函数定义为另一个函数中的闭包,那么 将会更好地作为关闭的一部分。等等。但是否则,全球是最好的选择。

In general, globals are bad. However, things like persistent caches are an exception to the "in general" rule. The whole point of them is that they're not local. If you had an object model in mind that made sense, seen would be much better as a member of whatever object dupCatch was a method on than as a global. If you had a good reason to define the function as a closure inside another function, seen would be better as part of that closure. And so on. But otherwise, a global is the best option.

如果你重组了你的代码,你可以使这更简单:

If you reorganized your code a bit, you could make this even simpler:

def pull():
    while True:
        for row in sqlPull():
            yield row
for row in unique_everseen(pull()):
    print row

...甚至:

for row in unique_everseen(chain.from_iterable(iter(sqlPull, None))):
    print row

请参阅迭代器和接下来的几个教程部分, itertools 文档, David M. Beazley的演讲了解这个最后一个版本。但对于新手来说,您可能希望坚持使用第二个版本。

See Iterators and the next few tutorial sections, the itertools documentation, and David M. Beazley's presentations to understand what this last version does. But for a novice, you might want to stick with the second version.

这篇关于从python中的元组列表中删除重复的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆