为什么 Spark SQL 认为索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

查看：33 发布时间：2021/11/14 22:15:43 sql apache-spark apache-spark-sql in-memory-database

本文介绍了为什么 Spark SQL 认为索引的支持不重要?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Quoting the Spark DataFrames, Datasets and SQL manual:

少数 Hive 优化尚未包含在 Spark 中.一些由于 Spark SQL 的特性，这些(例如索引)不太重要内存计算模型.其他人被安排在未来的版本中Spark SQL.

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

作为 Spark 的新手，我对此感到有些困惑，原因有两个:

Being new to Spark, I'm a bit baffled by this for two reasons:

Spark SQL 旨在处理大数据，至少在我的使用中如果数据大小远远超过可用内存的大小.假设这种情况并不少见，那么Spark SQL 的内存计算模型"?Spark SQL 仅推荐用于数据适合内存的情况?

Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

即使假设数据适合内存，对非常大的数据进行全面扫描数据集可能需要很长时间.我读了这个论点反对在内存数据库中建立索引，但我不相信.这个例子那里讨论了对 10,000,000 条记录表的扫描，但这不是真正的大数据.扫描具有数十亿条记录的表可能会导致SELECT x WHERE y=z"类型的简单查询将永远执行立即返回.

Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

我知道索引有 INSERT/UPDATE 慢、空间要求等缺点.但在我的用例中，我首先处理并加载大量数据到 Spark SQL，然后作为一个整体探索这些数据，没有进一步修改.Spark SQL 对于数据的初始分布式处理和加载很有用，但缺乏索引使得交互式探索比我预期的更慢和更麻烦.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

我想知道为什么 Spark SQL 团队认为索引不重要，以至于它超出了他们的路线图.是否有不同的使用模式可以提供索引的好处，而无需独立实施等效的东西?

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

为什么 Spark SQL 认为索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么 Spark SQL 认为索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭