有没有一种更快的替代方法可以从字典列表中获取最新的更新消息? [英] Is there a faster alternative to this approach to get last update message from list of dict?
问题描述
我需要从数据流中获取最新的更新消息.数据如下:
I need to get the last update message from a data stream. Data comes like this:
test_data =
[{u'category': u'3',
u'entity': u'entityA',
u'length': u'0',
u'timestamp': u'1562422690'},
{u'category': u'3',
u'entity': u'entityA',
u'length': u'1',
u'timestamp': u'1562422680'},
{u'category': u'3',
u'entity': u'entityB',
u'length': u'2',
u'timestamp': u'1562422691'},
{u'category': u'3',
u'entity': u'entityB',
u'length': u'3',
u'timestamp': u'1562422688'},
{u'category': u'3',
u'entity': u'entityC',
u'length': u'4',
u'timestamp': u'1562422630'},
{u'category': u'3',
u'entity': u'entityC',
u'length': u'5',
u'timestamp': u'1562422645'},
{u'category': u'3',
u'entity': u'entityD',
u'length': u'6',
u'timestamp': u'1562422645'}]
建议在此处
test_alexander = {entity: sorted([d for d in test_data if d.get('entity') == entity], key=lambda x: x['timestamp'])[-1]
for entity in set(d.get('entity') for d in test_data)}
返回此值(它完全按预期工作):
which returns this (it works exactly as intended):
{u'entityA': {u'category': u'3',
u'entity': u'entityA',
u'length': u'0',
u'timestamp': u'1562422690'},
u'entityB': {u'category': u'3',
u'entity': u'entityB',
u'length': u'2',
u'timestamp': u'1562422691'},
u'entityC': {u'category': u'3',
u'entity': u'entityC',
u'length': u'5',
u'timestamp': u'1562422645'},
u'entityD': {u'category': u'3',
u'entity': u'entityD',
u'length': u'6',
u'timestamp': u'1562422645'}}
问题是我有7k个唯一的实体",并且在"test_data"中有多达700万个列表项.上述解决方案需要很长时间,我想知道是否有更快的方法.
The problem is that I have 7k unique "entities", and as many as 7mil list items in "test_data". The above solution takes ages and I am wondering if there is a faster approach.
推荐答案
您应该能够将其作为具有单个比较的单个循环来进行.在循环过程中,只需跟踪到目前为止每种类别的最大可见量即可.
You should be able to do this as a single loop with a single comparison. Just keep track of the max seen so far for each category as you proceed through the loop:
from collections import defaultdict
def getMax(test_data):
d = defaultdict(lambda: {'timestamp':0})
for item in test_data:
if int(item['timestamp']) > int(d[item['entity']]['timestamp']):
d[item['entity']] = item
return d
返回值将是一个键为entity
的字典,每个字典的最大值.在循环中排序或构建数组的速度应该明显更快.仍然有700万需要一段时间.
The return value will be a dictionary keyed to the entity
with the max value for each. It should be significantly faster that sorting or building arrays in the loop. Still 7mil takes a while.
这篇关于有没有一种更快的替代方法可以从字典列表中获取最新的更新消息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!