Apache Drill 对 SQL Server 的性能不佳 [英] Apache Drill has bad performance against SQL Server

查看:78
本文介绍了Apache Drill 对 SQL Server 的性能不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 apache-drill 运行一个简单的 join-aggregate 查询,但速度并不是很好.我的测试查询是:

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:

SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category

其中事实大约有 422,000 行,产品大约有 600 行.分组返回 4 行.

Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.

首先我在 SqlServer 上测试了这个查询,并在大约 150 毫秒内得到了结果.

First I tested this query on SqlServer and got a result back in about 150ms.

使用钻头,我首先尝试直接连接到 SqlServer 并运行查询,但速度很慢(大约 5 秒).

With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).

然后我尝试将表保存到 json 文件中并从中读取,但速度更慢,所以我尝试了镶木地板文件.

Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.

我在大约 3 秒内在第一次运行中得到了结果.下一次运行大约是900ms,然后稳定在500ms左右.

I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.

仔细阅读,这毫无意义,而且训练速度应该更快!我尝试了REFRESH TABLE METADATA",但速度没有改变.

From reading around, this makes no sense and drill should be faster! I tried "REFRESH TABLE METADATA", but the speed didn't change.

我是通过钻取命令行在 Windows 上运行它的.

I was running this on windows, through the drill command line.

知道我是否需要一些额外的配置或其他东西吗?

Any idea if I need some extra configuration or something?

谢谢!

推荐答案

Drill 非常快,但它专为大型分布式查询而设计,同时连接多个不同的数据源......和你不会那样使用它.

Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.

SQL Server 是最快的关系数据库之一.数据被高效地存储、缓存在内存中,并且查询在单个进程中运行,因此扫描和连接非常快.相比之下,Apache Drill 有更多的工作要做.它必须将您的查询解释为分布式计划,将其发送到所有钻头进程,然后这些进程查找数据源,使用连接器访问数据,运行查询,将结果返回到第一个节点进行聚合,然后您接收最终输出.

SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.

根据数据源,Drill 可能需要读取所有数据并单独过滤,这会增加更多时间.JSON 文件很慢,因为它们是逐行解析的冗长文本文件.Parquet 的速度要快得多,因为它是一种面向列的二进制压缩存储格式,专为高效扫描而设计,尤其是当您仅访问某些列时.

Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.

如果你在一台机器上存储了一个小数据集,那么任何关系数据库都会比 Drill 更快.

Drill 使用 Parquet 在 500 毫秒内为您提供结果的事实实际上令人印象深刻,考虑到它需要做多少工作才能为您提供它提供的灵活性.如果您只有几百万行,请坚持使用 SQL 服务器.如果您有数十亿行,那么请使用 SQL Server 列存储功能以列式格式存储数据,从而实现出色的压缩和性能.

The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.

在以下情况下使用 Apache Drill:

Use Apache Drill when you:

  • 拥有数十亿行或更多行
  • 数据分布在多台机器上
  • 将非结构化数据(如 JSON)存储在没有标准架构的文件中
  • 想要将查询拆分到多台机器上以更快地并行运行
  • 想要访问来自不同数据库和文件系统的数据
  • 想要合并这些不同数据源的数据

这篇关于Apache Drill 对 SQL Server 的性能不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆