使用布尔掩码的python pandas .filter()方法 [英] python pandas .filter() method using boolean mask

查看：416 发布时间：2020/5/24 3:51:43 python pandas filter time-series

本文介绍了使用布尔掩码的python pandas .filter()方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个看起来像这样的数据框(z):

timestamp                   source  price
2004-01-05 14:55:09+00:00   Bank1   420.975
2004-01-05 14:55:10+00:00   Bank2   421.0
2004-01-05 14:55:22+00:00   Bank1   421.075
2004-01-05 14:55:34+00:00   Bank1   420.975
2004-01-05 14:55:39+00:00   Bank1   421.175
2004-01-05 14:55:45+00:00   Bank1   421.075
2004-01-05 14:55:52+00:00   Bank1   421.175
2004-01-05 14:56:12+00:00   Bank2   421.1
2004-01-05 14:56:33+00:00   Bank1   421.275

有时，银行2在某些时间范围内仅提交1个报价-我需要像这样整天扔掉，因为我需要银行提供2个或更多报价.如果银行2出现1次或少于1次，则舍弃这一天.

我已经通过创建一个布尔型掩码来实现此目的，我计划从该掩码中筛选出满足条件的所有日期:

r = z.groupby([z.index.date, z['source']]).size() > 1 
    # return boolean for each day/source if it appears at least once
r = r.groupby(level=0).all() == True 
    # ie. if the datetime 0th-level index contains all True, return True, otherwise False (meaning one source failed the criteria)

这将产生:

2004-01-05  True
2004-01-06  True
2004-01-07  True
2004-01-08  False
2004-01-09  True

完美.现在，我只需要从原始数据帧z中过滤掉它，同时保留原始结构(例如，二级频率，而不是每天).这意味着使用df.filter()方法.

我的原始数据框具有相同的结构(它们的.shape [0]相同):

2004-01-05  94
2004-01-06  24
2004-01-07  62
2004-01-08  30
2004-01-09  36

太好了.

在这里我感到困惑.我跑:

t = y.groupby(y.index.date).filter(lambda x: [x for x in r])

并接收TypeError: filter function returned a list, but expected a scalar bool.

基本上，我需要lambda函数简单地返回r中的每个x(布尔值).

我用一个非常复杂的方法解决了这个问题(只是把我以前解决的全部事情都解决了，不要将它扔到r变量中，而是将其作为lambda函数的一部分).

t = y.groupby(y.index.date).filter(lambda x: (x.groupby([x.index.date, x['source']]).size() > 1).groupby(level=0).all() == True) # ie. the datetime 0th-level index

这太混乱了，必须有一种基本的说法，这是我的数据框z，然后是groupby('z.index.date')，然后是.filter()，基于布尔掩码r.

这是我从熊猫教程中找到的内容，但是由于某种原因，.between_time()部分无效.不仅当.between_time()条件为true时，它还会过滤掉所有长度为<== 1的内容.

t = y.groupby([y.index.date, y['source']]).filter(lambda x: len(x.between_time('14:00','15:00') > 1)

解决方案

我认为我想出了日期:

仅在数据框z

中为日期创建新列

z['date'] = z.index.date

然后保留布尔系列r

中的日期

z[z['date'].isin(r.index)]

I have a dataframe (z) that looks like this:

timestamp                   source  price
2004-01-05 14:55:09+00:00   Bank1   420.975
2004-01-05 14:55:10+00:00   Bank2   421.0
2004-01-05 14:55:22+00:00   Bank1   421.075
2004-01-05 14:55:34+00:00   Bank1   420.975
2004-01-05 14:55:39+00:00   Bank1   421.175
2004-01-05 14:55:45+00:00   Bank1   421.075
2004-01-05 14:55:52+00:00   Bank1   421.175
2004-01-05 14:56:12+00:00   Bank2   421.1
2004-01-05 14:56:33+00:00   Bank1   421.275

Sometimes, there are time windows where Bank 2 submits only 1 quote - I need to throw out all days like this because I need 2 or more quotes by a bank. If Bank 2 appears 1 or fewer times, throw out the day.

I've accomplished this by creating a boolean mask from which I plan to filter out all days that satisfy the criteria:

r = z.groupby([z.index.date, z['source']]).size() > 1 
    # return boolean for each day/source if it appears at least once
r = r.groupby(level=0).all() == True 
    # ie. if the datetime 0th-level index contains all True, return True, otherwise False (meaning one source failed the criteria)

This yields:

2004-01-05  True
2004-01-06  True
2004-01-07  True
2004-01-08  False
2004-01-09  True

Perfect. Now I just need to filter it from the original dataframe z while keeping the original structure (ie. second-level frequency, not day-by-day). That means use the df.filter() method.

My original dataframe has the same structure (and their .shape[0]'s are the same):

2004-01-05  94
2004-01-06  24
2004-01-07  62
2004-01-08  30
2004-01-09  36

Great.

Here's where I get confused. I run:

t = y.groupby(y.index.date).filter(lambda x: [x for x in r])

And receive TypeError: filter function returned a list, but expected a scalar bool.

Basically, I need the lambda function simply return each x (boolean) in r.

I solved this in a really convoluted instead (just take the whole thing I solved before and don't throw it into an r variable, but instead make it part of the lambda function).

t = y.groupby(y.index.date).filter(lambda x: (x.groupby([x.index.date, x['source']]).size() > 1).groupby(level=0).all() == True) # ie. the datetime 0th-level index

This is super messy and there must be a basic way to say, here is my dataframe z, then groupby('z.index.date'), then .filter() based on the boolean mask r.

Edit: this is what I found from the pandas tutorial, but I'm for some reason, the .between_time() part doesn't work. It filters out everything with length <= 1, not only when the .between_time() condition are true.

t = y.groupby([y.index.date, y['source']]).filter(lambda x: len(x.between_time('14:00','15:00') > 1)

解决方案

I think I figured this out for dates:

create a new column for dates only in dataframe z

z['date'] = z.index.date

then keep the days that are in the boolean series r

z[z['date'].isin(r.index)]

这篇关于使用布尔掩码的python pandas .filter()方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用布尔掩码的python pandas .filter()方法 [英] python pandas .filter() method using boolean mask

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用布尔掩码的python pandas .filter()方法 [英] python pandas .filter() method using boolean mask

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭