如何在 pandas 滚动窗口中基于多个列查找重复项? [英] How to find duplicate based upon multiple columns in a rolling window in pandas?

查看：70 发布时间：2020/6/12 19:37:11 python sql pandas duplicates rolling-computation

本文介绍了如何在 pandas 滚动窗口中基于多个列查找重复项?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

样本数据

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.

我有一些这样的代码

    df = pd.DataFrame()
for line in sys.stdin:
    data = json.loads(line)
    # df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
    df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
    df1['time'] = pd.to_datetime(df1['time'])
    df = df.append(df1)
    # df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()

print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))

当前输出

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z  merchantE      90 2019-02-13 11:02:30

2

预期输出

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30

由于数据正在流式传输.我想检查重复的记录(其商人和金额值相同)是否在两分钟内到达，所以我将其丢弃，并且不对其进行任何处理.将其打印为副本.

As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.

我是否必须对索引压缩或groupby进行某些处理?但是然后如何等于多列. 或在两列上有一些滚动条件，但找不到任何方法.

Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.

我在这里想念什么?

谢谢

编辑

#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
     res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        # res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
        res['timediff'] = (data['transaction']['time'] - res['time'])
        if len(res.index) >1:
           print(res)

所以我正在尝试这样的事情，如果结果少于120秒，我可以处理它. 但是最终产生的df格式为

so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of

                      merchant  amount                time       concat          timediff
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00  merchantF10 -1 days +23:59:20
2019-02-13 11:02:20  merchantF      10 2019-02-13 11:02:20  merchantF10          00:00:00

2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30  merchantE10 00:01:00
2019-02-13 11:02:00  merchantE      10 2019-02-13 11:02:00  merchantE10 00:00:30
2019-02-13 11:02:30  merchantE      10 2019-02-13 11:02:30  merchantE10 00:00:00

-1天+23:59:20 我认为采用绝对值可以消除这种格式吗?

-1 days +23:59:20 this format I think can be delt with taking Absolute value?

如何将时间转换为可以与120秒进行比较的格式? pd.to_deltatime()不适用于我，或者我使用的是错误的.

how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.

推荐答案

所以我使它起作用了，但是由于它不支持字符串类型，所以不能在滚动窗口中使用.该功能也在Pandas Repo上进行了报告和请求.

So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.

该问题的我的解决方案摘要:

My solution snippet to the problem:

    if len(df.index) > 0:
        res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
        if res.timediff.any():
            continue
    df = df.append(df1)
print(df)

样本数据:

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}

输出:

                      merchant  amount                time
2019-02-13 10:00:00  merchantA      20 2019-02-13 10:00:00
2019-02-13 11:00:01  merchantB      90 2019-02-13 11:00:01
2019-02-13 11:00:10  merchantC      10 2019-02-13 11:00:10
2019-02-13 11:00:20  merchantD      10 2019-02-13 11:00:20
2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00
2019-02-13 11:05:20  merchantF      10 2019-02-13 11:05:20

这篇关于如何在 pandas 滚动窗口中基于多个列查找重复项?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 pandas 滚动窗口中基于多个列查找重复项? [英] How to find duplicate based upon multiple columns in a rolling window in pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 pandas 滚动窗口中基于多个列查找重复项? [英] How to find duplicate based upon multiple columns in a rolling window in pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭