使用谓词从pyarrow.parquet.ParquetDataset过滤行 [英] Using predicates to filter rows from pyarrow.parquet.ParquetDataset
问题描述
我有一个存储在s3上的实木复合地板数据集,我想查询数据集中的特定行.我可以使用petastorm
来做到这一点,但现在我只想使用pyarrow
来做到这一点.
I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm
but now I want to do that using only pyarrow
.
这是我的尝试:
import pyarrow.parquet as pq
import s3fs
fs = s3fs.S3FileSystem()
dataset = pq.ParquetDataset(
'analytics.xxx',
filesystem=fs,
validate_schema=False,
filters=[('event_name', '=', 'SomeEvent')]
)
df = dataset.read_pandas().to_pandas()
但是这会返回pandas DataFrame,就像过滤器不起作用一样,也就是说,我的行具有不同的event_name
值.有什么我想念的东西或我误解了的东西吗?在获得pandas DataFrame之后,我可以进行过滤,但是我会使用比所需更多的内存空间.
But that returns a pandas DataFrame as if the filter didn't work, i.e I have rows with various values of event_name
. Is there something I am missing or something I misunderstood? I could filter after getting the pandas DataFrame but I would use much more memory space than needed.
推荐答案
当前,filters
功能仅在文件级别实现,尚未在行级别实现.
Currently, the filters
functionality is only implemented at the file level, not yet at the row level.
因此,如果您有一个数据集作为嵌套层次结构中多个分区镶花文件的集合(此处描述了分区数据集的类型:
So if you have a dataset as a collection of multiple, partitioned parquet files in a nested hierarchy (the type of partitioned datasets described here: https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files), you can use the filters
argument to only read a subset of the files.
But, you can't yet use it for reading only a subset of the row groups of a single file (see https://issues.apache.org/jira/browse/ARROW-1796).
但是,您会收到一条错误消息,指出了这样一个无效的过滤器,这是很好的.我为此打开了一个问题: https://issues.apache.org/jira/browse /ARROW-5572
But, it would be nice that you get an error message of specifying such an invalid filter. I opened an issue for that: https://issues.apache.org/jira/browse/ARROW-5572
这篇关于使用谓词从pyarrow.parquet.ParquetDataset过滤行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!