Spark+Parquet“数据库"的设计 [英] Design of Spark + Parquet "database"

查看：21 发布时间：2021/11/14 22:42:06 apache-spark apache-spark-sql parquet

本文介绍了Spark+Parquet“数据库"的设计的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我每天收到 100G 文本文件，我希望创建一个可从 Spark 访问的高效数据库".数据库"是指对数据执行快速查询的能力(回溯大约一年)，每天增量添加数据，最好没有读锁.

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.

假设我想使用 Spark SQL 和 parquet，实现这一目标的最佳方法是什么?

Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?

放弃并发读/写并将新数据附加到现有的镶木地板文件中.
为每一天的数据创建一个新的镶木地板文件，并利用 Spark 可以加载多个镶木地板文件的事实来允许我加载例如一整年.这有效地为我提供了并发性".
还有别的吗?

请随意提出其他选项，但假设我现在正在使用镶木地板，因为从我读到的内容来看，这将对许多其他人有所帮助.

Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.

Spark+Parquet“数据库"的设计 [英] Design of Spark + Parquet "database"

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark+Parquet“数据库"的设计 [英] Design of Spark + Parquet &quot;database&quot;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark+Parquet“数据库"的设计 [英] Design of Spark + Parquet "database"

登录关闭