从字典列表中获取最新更新的字典消息 [英] Get last updated dict message from list of dict

查看:63
本文介绍了从字典列表中获取最新更新的字典消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从数据流中获取到实体的最新更新消息.数据以字典列表的形式出现,其中每个字典都是对实体的更新消息.我只需要对实体的最新更新.我的输入来自一列字典,而输出则必须位于一列字典中.

I am trying to get the latest update message to an entity from a data stream. Data comes as a list of dicts where each dict is an update message to the entity. I need only the latest update to the entity. My input comes as a list of dicts and the output needs to be in a dict of dicts

注意:仅更新长度,类别保持不变.我知道哪个是最新的更新,因为对于该实体,它将具有最新的时间戳记

Notes: Only length gets updated, category stays static. I know which one is the latest update because, for that entity, it will have the latest timestamp

数据如下:

[{u'length': u'1',
  u'category': u'3',
  u'entity': u'entityA',
  u'timestamp': u'1562422690'},
 {u'length': u'1.1',
  u'category': u'3',
  u'entity': u'entityA',
  u'timestamp': u'1562422691'},
 {u'length': u'1.2',
  u'category': u'3',
  u'entity': u'entityA',
  u'timestamp': u'1562422692'},
 {u'length': u'0.9',
  u'category': u'3',
  u'entity': u'entityB',
  u'timestamp': u'1562422689'},
 {u'length': u'0.9',
  u'category': u'3',
  u'entity': u'entityB',
  u'timestamp': u'1562422690'}]

我需要操纵它,所以我只会得到:

I need to manipulate this so I only get:

{u'entityA':{u'length': u'1.2', 
             u'category': u'3', 
             u'entity': u'entityA', 
             u'timestamp': u'1562422692'},
 u'entityB':{u'length': u'0.9', 
             u'category': u'3', 
             u'entity': u'entityB', 
             u'timestamp': u'1562422690'}}

我是python的新手-我知道我可以使用以下方法在SQL中实现此目标:

I am new to python - I knew I could achieve this in SQL with:

select * from
(select
   length, 
   category, 
   entity, 
   timestamp, 
   row_number () over (partition by entity order by timestamp desc) as rnumb
from data
)foo
where rnumb = 1

但是我正在python中执行此操作,似乎在python中通过SQL似乎是一种替代方法,不幸的是我的上游数据SQL不支持row_number()

but I am doing this in python and it seems like too much of a workaround to go through SQL within python, unfortunately my upstream data SQL does not support row_number()

在我尝试了吉莱斯皮和亚历山大的方法后,更新了这个问题. Gillespie的方法似乎行不通,Alexander的方法行得通,但在处理大量数据时会变得非常慢-有任何更快的选择吗?

Updating this question after I tried both Gillespie and Alexander's approaches. Gillespie's approach does not seem to work, Alexander's does work but becomes very slow when dealing with a lot of data - any speedier alternative?

test_data = [
{u'length': u'0',
  u'category': u'3',
  u'entity': u'entityA',
  u'timestamp': u'1562422690'},
{u'length': u'1',
  u'category': u'3',
  u'entity': u'entityA',
  u'timestamp': u'1562422680'},
{u'length': u'2',
  u'category': u'3',
  u'entity': u'entityB',
  u'timestamp': u'1562422691'},
{u'length': u'3',
  u'category': u'3',
  u'entity': u'entityB',
  u'timestamp': u'1562422688'},
{u'length': u'4',
  u'category': u'3',
  u'entity': u'entityC',
  u'timestamp': u'1562422630'},
{u'length': u'5',
  u'category': u'3',
  u'entity': u'entityC',
  u'timestamp': u'1562422645'}
]

>>> test_gillespie = max(test_data, lambda x: x["timestamp"])
test_gillespie

[{u'category': u'3',
  u'entity': u'entityA',
  u'length': u'0',
  u'timestamp': u'1562422690'},
 {u'category': u'3',
  u'entity': u'entityA',
  u'length': u'1',
  u'timestamp': u'1562422680'},
 {u'category': u'3',
  u'entity': u'entityB',
  u'length': u'2',
  u'timestamp': u'1562422691'},
 {u'category': u'3',
  u'entity': u'entityB',
  u'length': u'3',
  u'timestamp': u'1562422688'},
 {u'category': u'3',
  u'entity': u'entityC',
  u'length': u'4',
  u'timestamp': u'1562422630'},
 {u'category': u'3',
  u'entity': u'entityC',
  u'length': u'5',
  u'timestamp': u'1562422645'}]

>>>test_alexander = {entity: sorted([d for d in test_data if d.get('entity') == entity], key=lambda x: x['timestamp'])[-1]
     for entity in set(d.get('entity') for d in test_data)}
test_alexander

{u'entityA': {u'category': u'3',
  u'entity': u'entityA',
  u'length': u'0',
  u'timestamp': u'1562422690'},
 u'entityB': {u'category': u'3',
  u'entity': u'entityB',
  u'length': u'2',
  u'timestamp': u'1562422691'},
 u'entityC': {u'category': u'3',
  u'entity': u'entityC',
  u'length': u'5',
  u'timestamp': u'1562422645'}}

推荐答案

假定您的数据已分配给名为data的变量,则可以将字典理解与sorted一起使用.对于每个实体(set(d.get('entity') for d in data)创建一组所有唯一实体),请根据时间戳对其数据进行排序,然后通过[-1]索引选择获取最后一项(即最新的).

Assuming your data is assigned to a variable called data, you can use a dictionary comprehension together with sorted. For each entity (set(d.get('entity') for d in data) creates a set of all unique entities), sort its data based on the timestamp and then take the last item (i.e. the most recent) via the [-1] index selection.

>>> {entity: sorted([d for d in data if d.get('entity') == entity], key=lambda x: x['timestamp'])[-1]
     for entity in set(d.get('entity') for d in data)}
{'entityA': {'length': '1.2',
  'category': '3',
  'entity': 'entityA',
  'timestamp': '1562422692'},
 'entityB': {'length': '0.9',
  'category': '3',
  'entity': 'entityB',
  'timestamp': '1562422690'}}

一种更快的方法涉及使用熊猫.

A faster method would involve using pandas.

import pandas as pd

df = pd.DataFrame(data).sort_values('timestamp')
result = df.groupby('entity', as_index=False).last()
>>> result
    entity category length   timestamp
0  entityA        3    1.2  1562422692
1  entityB        3    0.9  1562422690

>>> result.to_dict('r')
[{'entity': 'entityA',
  'category': '3',
  'length': '1.2',
  'timestamp': '1562422692'},
 {'entity': 'entityB',
  'category': '3',
  'length': '0.9',
  'timestamp': '1562422690'}]

这篇关于从字典列表中获取最新更新的字典消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆