Spark HBase/BigTable-宽/稀疏数据帧持久性 [英] Spark HBase/BigTable - Wide/sparse dataframe persistence

查看:56
本文介绍了Spark HBase/BigTable-宽/稀疏数据帧持久性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在BigTable上保留一个非常稀疏的Spark Dataframe(> 100,000列)(其中99%的值为空),同时仅保留非null值(以避免存储成本).

I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).

是否有一种方法可以在Spark中指定在写入时忽略空值?

Is there a way to specify in Spark to ignore nulls when writing?

谢谢!

推荐答案

可能(未对其进行测试),在将Spark DataFrame写入HBase/BigTable之前,您可以通过使用以下方法滤除每行中具有空值的列来对其进行转换自定义函数,如此处使用熊猫示例所示: https://stackoverflow.com/a/59641595/3227693 .据我所知,没有内置的连接器支持此功能.

Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.

或者,您可以尝试以Parquet之类的列文件格式存储数据,因为它们是Spark镶木地板分区:大量文件)

Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)

这篇关于Spark HBase/BigTable-宽/稀疏数据帧持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆