使用谓词从pyarrow.parquet.ParquetDataset过滤行 [英] Using predicates to filter rows from pyarrow.parquet.ParquetDataset

查看：304 发布时间：2020/5/24 2:19:30 python pandas amazon-s3 parquet pyarrow

本文介绍了使用谓词从pyarrow.parquet.ParquetDataset过滤行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个存储在s3上的实木复合地板数据集，我想查询数据集中的特定行.我可以使用petastorm来做到这一点，但现在我只想使用pyarrow来做到这一点.

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow.

这是我的尝试:

import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()

dataset = pq.ParquetDataset(
    'analytics.xxx', 
    filesystem=fs, 
    validate_schema=False, 
    filters=[('event_name', '=', 'SomeEvent')]
)

df = dataset.read_pandas().to_pandas()

但是这会返回pandas DataFrame，就像过滤器不起作用一样，也就是说，我的行具有不同的event_name值.有什么我想念的东西或我误解了的东西吗?在获得pandas DataFrame之后，我可以进行过滤，但是我会使用比所需更多的内存空间.

But that returns a pandas DataFrame as if the filter didn't work, i.e I have rows with various values of event_name. Is there something I am missing or something I misunderstood? I could filter after getting the pandas DataFrame but I would use much more memory space than needed.

使用谓词从pyarrow.parquet.ParquetDataset过滤行 [英] Using predicates to filter rows from pyarrow.parquet.ParquetDataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用谓词从pyarrow.parquet.ParquetDataset过滤行 [英] Using predicates to filter rows from pyarrow.parquet.ParquetDataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭