为什么Spark SQL认为对索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

查看：66 发布时间：2021/4/8 19:31:42 sql apache-spark apache-spark-sql in-memory-database

本文介绍了为什么Spark SQL认为对索引的支持不重要?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Quoting the Spark DataFrames, Datasets and SQL manual:

Spark尚未包含一些Hive优化.一些由于Spark SQL的原因，这些(例如索引)的重要性不那么高内存中的计算模型.其他将投放到将来的版本中SQL版本.

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

对于Spark还是陌生人，我对此感到有点困惑，原因有二:

Being new to Spark, I'm a bit baffled by this for two reasons:

Spark SQL旨在处理大数据，至少在我使用时如果数据大小远远超过可用内存的大小.假设这种情况并不罕见，那么"Spark SQL内存计算模型"?仅建议将Spark SQL用于数据适合存储在什么情况下?

Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

即使假设数据适合存储在内存中，也可以在很大的范围内进行全面扫描数据集可能需要很长时间.我阅读了此论据反对在内存数据库中建立索引，但是我没有被说服.这个例子那里讨论了对10,000,000条记录表的扫描，但这不是真正的大数据.扫描具有数十亿条记录的表可能会导致"SELECT x WHERE y = z"类型的简单查询将永远取而代之立即返回.

Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

我知道索引具有缺点，例如INSERT/UPDATE速度慢，空间要求等等.但是在我的用例中，我首先处理并将大量数据加载到Spark SQL中，然后将这些数据作为一个整体进行探索，而无需进行任何操作.进一步的修改.Spark SQL对于数据的初始分布式处理和加载很有用，但是缺少索引使交互式探索比我预期的要慢和麻烦.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

我想知道为什么Spark SQL团队认为索引不重要，甚至超出了他们的发展计划.是否有一种不同的使用模式可以提供索引的好处，而无需依靠独立地实现等效的实现?

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

为什么Spark SQL认为对索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么Spark SQL认为对索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭