Spark + Parquet的“数据库"设计 [英] Design of Spark + Parquet "database"

查看：73 发布时间：2020/9/4 8:17:04 apache-spark apache-spark-sql parquet

本文介绍了Spark + Parquet的“数据库"设计的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我每天都会收到100G文本文件，我希望创建一个可从Spark访问的高效数据库". 数据库"是指对数据执行快速查询(可以追溯到大约一年前)并每天递增地添加数据的能力，最好没有读取锁定.

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.

假设我想使用Spark SQL和镶木地板，实现此目的的最佳方法是什么?

Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?

放弃并发读写，并将新数据追加到现有的镶木地板文件中.
为每天的数据创建一个新的实木复合地板文件，并利用Spark可以加载多个实木复合地板文件的事实来允许我加载例如一整年.这实际上给了我并发性".
还有别的吗?

可以随意提出其他选择，但是假设我现在正在使用镶木地板，因为从我阅读的内容来看，这将对许多其他人有所帮助.

Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.

Spark + Parquet的“数据库"设计 [英] Design of Spark + Parquet "database"

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark + Parquet的“数据库"设计 [英] Design of Spark + Parquet &quot;database&quot;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark + Parquet的“数据库"设计 [英] Design of Spark + Parquet "database"

登录关闭