如何在 pandas 滚动窗口中基于多个列查找重复项? [英] How to find duplicate based upon multiple columns in a rolling window in pandas?
问题描述
样本数据
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.
我有一些这样的代码
df = pd.DataFrame()
for line in sys.stdin:
data = json.loads(line)
# df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
df1['time'] = pd.to_datetime(df1['time'])
df = df.append(df1)
# df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()
print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
当前输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30
2
预期输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
由于数据正在流式传输.我想检查重复的记录(其商人和金额值相同)是否在两分钟内到达,所以我将其丢弃,并且不对其进行任何处理.将其打印为副本.
As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.
我是否必须对索引压缩或groupby进行某些处理?但是然后如何等于多列. 或在两列上有一些滚动条件,但找不到任何方法.
Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.
我在这里想念什么?
谢谢
编辑
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
# res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
res['timediff'] = (data['transaction']['time'] - res['time'])
if len(res.index) >1:
print(res)
所以我正在尝试这样的事情,如果结果少于120秒,我可以处理它. 但是最终产生的df格式为
so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of
merchant amount time concat timediff
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20
2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00
2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30
2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1天+23:59:20 我认为采用绝对值可以消除这种格式吗?
-1 days +23:59:20 this format I think can be delt with taking Absolute value?
如何将时间转换为可以与120秒进行比较的格式? pd.to_deltatime()不适用于我,或者我使用的是错误的.
how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.
推荐答案
所以我使它起作用了,但是由于它不支持字符串类型,所以不能在滚动窗口中使用.该功能也在Pandas Repo上进行了报告和请求.
So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.
该问题的我的解决方案摘要:
My solution snippet to the problem:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
样本数据:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
输出:
merchant amount time
2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00
2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01
2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10
2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00
2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20
这篇关于如何在 pandas 滚动窗口中基于多个列查找重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!