为什么Spark SQL认为对索引的支持不重要? [英] Why Spark SQL considers the support of indexes unimportant?

查看:66
本文介绍了为什么Spark SQL认为对索引的支持不重要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用Spark DataFrames,数据集和SQL手册:

Quoting the Spark DataFrames, Datasets and SQL manual:

Spark尚未包含一些Hive优化.一些由于Spark SQL的原因,这些(例如索引)的重要性不那么高内存中的计算模型.其他将投放到将来的版本中SQL版本.

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

对于Spark还是陌生人,我对此感到有点困惑,原因有二:

Being new to Spark, I'm a bit baffled by this for two reasons:

  1. Spark SQL旨在处理大数据,至少在我使用时如果数据大小远远超过可用内存的大小.假设这种情况并不罕见,那么"Spark SQL内存计算模型"?仅建议将Spark SQL用于数据适合存储在什么情况下?

  1. Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

即使假设数据适合存储在内存中,也可以在很大的范围内进行全面扫描数据集可能需要很长时间.我阅读了此论据反对在内存数据库中建立索引,但是我没有被说服.这个例子那里讨论了对10,000,000条记录表的扫描,但这不是真正的大数据.扫描具有数十亿条记录的表可能会导致"SELECT x WHERE y = z"类型的简单查询将永远取而代之立即返回.

Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

我知道索引具有缺点,例如INSERT/UPDATE速度慢,空间要求等等.但是在我的用例中,我首先处理并将大量数据加载到Spark SQL中,然后将这些数据作为一个整体进行探索,而无需进行任何操作.进一步的修改.Spark SQL对于数据的初始分布式处理和加载很有用,但是缺少索引使交互式探索比我预期的要慢和麻烦.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

我想知道为什么Spark SQL团队认为索引不重要,甚至超出了他们的发展计划.是否有一种不同的使用模式可以提供索引的好处,而无需依靠独立地实现等效的实现?

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

推荐答案

索引输入数据

  • 对外部数据源进行索引不在Spark范围内的根本原因是Spark不是数据管理系统,而是批处理数据处理引擎.由于它不拥有正在使用的数据,因此无法可靠地监视更改,因此无法维护索引.
  • 如果数据源支持索引,则Spark可以通过谓词下推之类的机制间接使用它.

索引分布式数据结构:

    标准索引技术需要持久且定义良好的数据分布,但是Spark中的数据通常是短暂的,其确切分布是不确定的.
  • 通过适当的分区以及列式存储和压缩相结合实现的高级数据布局可以提供非常有效的分布式访问,而无需创建,存储和维护索引.这是不同的内存中列式系统所使用的常见模式.
  • li>
  • standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
  • high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.

话虽如此,Spark生态系统中确实存在某些形式的索引结构.最值得注意的是,Databricks在其平台上提供了数据跳过索引.

That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

其他项目,例如 Succinct (今天大多无效)采用不同的方法,并使用高级压缩技术具有随机访问支持.

Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

当然,这提出了一个问题-如果您需要有效的随机访问,为什么不从一开始就使用被设计为数据库的系统.那里有很多选择,包括至少一些由Apache Foundation维护的选择.同时,随着项目的发展,Spark也随之发展,您使用的报价可能无法完全反映未来的Spark方向.

Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

这篇关于为什么Spark SQL认为对索引的支持不重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆