使用dask read_parquet方法过滤会产生不需要的结果 [英] filtering with dask read_parquet method gives unwanted results

查看:461
本文介绍了使用dask read_parquet方法过滤会产生不需要的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用dask read_parquet方法和filters kwarg读取镶木地板文件.但是有时它不会根据给定的条件进行过滤.

I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition.

示例: dates列创建和保存数据框

Example: creating and saving data frame with a dates column

import pandas as pd
import numpy as np
import dask.dataframe as dd

nums  = range(1,6)
dates = pd.date_range('2018-07-01', periods=5, freq='1d')
df = pd.DataFrame({'dates':dates, 'nums': nums})

ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet')

当我从'test_par'文件夹中读取并过滤dates列时,它似乎不起作用

when i read and filter on the dates column from the 'test_par' folder it doesn't seem to work

filters=[('dates', '>', np.datetime64('2018-07-04'))]
df  = dd.read_parquet('test_par', engine='fastparquet', filters=filters).compute()

在输出中可以看到

,其中存在2018-07-032018-07-04.

as you can see in the output, 2018-07-03 and 2018-07-04 are present.

+-------+------------+------+
|       | dates      | nums |
+-------+------------+------+
| index |            |      |
+-------+------------+------+
| 2     | 2018-07-03 | 3    |
+-------+------------+------+
| 3     | 2018-07-04 | 4    |
+-------+------------+------+
| 4     | 2018-07-05 | 5    |
+-------+------------+------+

我做错什么了吗?还是我应该在github上报告这个问题?

Am i doing something wrong ? or should i report this on github ?

推荐答案

filters关键字是针对行组的操作(行组是一组数据行的拼写术语,例如数据分区-框架).它不会在分区内进行任何过滤.

The filters keyword is a row-group-wise action (row-group is the parquet term for a set of data rows, like partition for a data-frame). It does not do any filtering within partitions.

使用filters时,将排除分区,其中根据文件中的最大/最小统计信息,给定分区中没有 no 行可以匹配给定过滤器.例如,如果您指定x> 5,则将排除具有min = 2,max = 4的分区,但不会排除具有min = 2,max = 6的分区,即使后者仅包含满足条件的行.过滤器.

When you use filters, you will be excluding partitions in which, according to the max/min statistics in the file, there are no rows in a given partition which can match the given filter. For example, if you specify x>5, a partition that has min=2,max=4 will be excluded, but one with min=2,max=6 will not, even though the latter will contain only some rows that meet the filter.

要过滤数据,您仍应使用常规语法

To filter the data, you should still use usual syntax

df[df.dates > np.datetime64('2018-07-04')]

除过滤器外,还可以将过滤器的使用视为可选优化.没有它,Dask将不得不读取没有良好数据的偶数分区,然后应用该条件,从而导致这些分区没有结果.如果可能的话,最好不要加载它们.

in addition to filter, and view the use of filters as an optional optimisation. Without it, Dask would have to read even partitions with no good data, and then apply the condition, resulting in no results for those partitions. Better not to load them, if possible.

这篇关于使用dask read_parquet方法过滤会产生不需要的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆