羽毛和镶木地板有什么区别? [英] What are the differences between feather and parquet?

查看:92
本文介绍了羽毛和镶木地板有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这两种都是列(磁盘)存储格式,用于数据分析系统. 两者都集成在 Apache Arrow (箭头对应,作为列式内存分析层.

两种格式有何不同?

在可能的情况下,与熊猫一起工作时,您总是喜欢羽毛吗?

在什么情况下羽毛更合适比 parquet 和 反过来?


附录

我在这里找到了一些提示 https://github.com/wesm/feather/issues/188 , 但是考虑到这个项目的年龄,它可能有点过时了.

这不是一个严格的速度测试,因为我只是在转储整个负载 数据框,但如果您从不给您一些印象 听说过以下格式:

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html

解决方案

  • Parquet格式是为长期存储而设计的,其中Arrow更适合于短期或临时存储(在1.0.0版本发布后,Arrow可能更适合于长期存储,因为二进制格式然后就会稳定)

  • 与Feather相比,Parquet的编写成本更高,因为它具有更多的编码和压缩层.羽毛是未修饰的原始柱状箭头记忆.将来我们可能会在Feather中添加简单的压缩.

  • 由于词典编码,RLE编码和数据页压缩,Parquet文件通常比Feather文件小

  • Parquet是许多分析系统支持的标准分析存储格式:Spark,Hive,Impala,各种AWS服务,将来由BigQuery等支持.因此,如果您要进行分析,Parquet是一个不错的选择作为供多个系统查询的参考存储格式

您显示的基准将非常嘈杂,因为您读取和写入的数据非常小.您应该尝试压缩至少100MB或1GB以上的数据,以获取更多有用的基准信息,例如 http://wesmckinney.com/blog/python-parquet-multithreading/

希望这会有所帮助

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer.

How do both formats differ?

Should you always prefer feather when working with pandas when possible?

What are the use cases where feather is more suitable than parquet and the other way round?


Appendix

I found some hints here https://github.com/wesm/feather/issues/188, but given the young age of this project, it's possibly a bit out of date.

Not a serious speed test because I'm just dumping and loading a whole Dataframe but to give you some impression if you never heard of the formats before:

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html

解决方案

  • Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

  • Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

  • Parquet is a standard storage format for analytics that's supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

这篇关于羽毛和镶木地板有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆