为什么spark仍然比mysql慢? [英] why spark still slow than mysql?

查看:39
本文介绍了为什么spark仍然比mysql慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用带有数据源 MySQL 的 Apache spark.我有一个集群,有 1 个主节点和 1 个从节点,并且都有 8 GB 内存和 2 个内核,我正在使用 spark-shell 将我的 SQL 查询提交给 spark,并且该表具有 6402821 这么多行.我正在该桌子上执行分组.MySQL 花费的时间是 5.2 秒,而在我执行查询时使用 spark 的时间是 21 秒.为什么会这样?

I am trying to work with Apache spark with data source MySQL. I have a cluster having 1 master and 1 slave node and both have 8 GB ram and 2 cores I am submitting my SQL query to spark using spark-shell and that table having 6402821 this many rows. I am performing a group by onto that table. and time taken by MySQL is 5.2secs and using spark when I am performing query the time is 21Secs. why is this happening?

我还设置了一些配置,例如partitionColumn、upperBound、lowerBound 和numofPartitions,但仍然没有变化.

i am also setting some configurations like partitionColumn, upperBound, lowerBound, and numofPartitions but still no change.

我也尝试过使用 1、2、4 个内核执行查询,但 spark 花费的时间相同 21 秒.

I have also tried with executing the query using 1,2,4 cores but the time taken by the spark is same 21Secs.

出现这个问题是因为我的 MySQL 数据库在一台机器上吗所有 Spark 节点都尝试在单台机器上查询数据?

is this problem occurs because of my MySQL database is on a single machine and all spark nodes try to query on data onto that single machine?

谁能帮我解决这个问题?

Can any one help me to solve this issue?

有一个名为 demo_call_stats 的表的数据库,我试图在该表上查询:

the database having a table called demo_call_stats on which i am trying to query is:

val jdbcDF = spark.read.format("jdbc").options( Map("url" ->  "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zeroDateTimeBehaviour"->"convertToNull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchSize" -> "10000", "partitionColumn" -> "newpartition", "lowerBound" -> "0", "upperBound" -> "4", "numPartitions" -> "4")).load()

jdbcDF.createOrReplaceTempView("call_stats")

val sqlDF = sql("select Count(*), classification_id from call_stats where campaign_id = 77 group by classification_id")

sqlDF.show()

任何帮助将不胜感激.

谢谢

推荐答案

您应该在这里了解以下几点:

There is a couple of things you should understand here:

尽管您可能听说过,Spark 并不比 MySQL 快",仅仅因为这种一般性没有任何意义.对于某些查询,Spark 比 MySQL 快,而对于其他查询,MySQL 比 Spark 快.一般来说,MySQL 是一个关系型数据库,这意味着它被设计为服务于作为应用程序的后端.它经过优化,可以有效地访问记录,只要它们被编入索引.

Despite what you might have heard, Spark isn't 'faster than MySQL', simply because this kind of generality doesn't mean anything. Spark is faster than MySQL for some queries, and MySQL is faster than Spark for others. Generally speaking, MySQL is a relational database, meaning it has been conceived to serve as a back-end for an application. It is optimized to access records efficiently as long as they are indexed.

在考虑数据库时,我喜欢将它们视为一个图书馆,其中有一名图书管理员可以帮助您获取所需的书籍(我说的是这里的一个非常古老的学校图书馆,没有任何计算机可以帮助图书管理员).

When thinking about databases, I like to think of them as a library with one librarian to help you get the books you want (I am speaking about a very old school library here, without any computer to help the librarian).

如果你问你的图书管理员:我想知道你有多少本关于地缘政治的书",图书管理员可以前往地缘政治书架并计算该书架上的书籍数量.

If you ask your librarian: "I want to know how many books you have that are about Geopolitics", the librarian can go to the Geopolitics shelf and count the number of books on that shelf.

如果你问你的图书管理员:我想知道你有多少本书至少有 500 页",图书管理员将不得不查看图书馆中的每一本书来回答您的问题.在 SQL 中,这称为全表扫描.当然,您可以让多个图书馆员(处理器)处理查询以加快速度,但在您的图书馆(计算机)中,您不能拥有超过几个(假设最多 16 个).

If you ask your librarian: "I want to know how many books you have that have at least 500 pages", the librarian will have to look at every single book in the library to answer your query. In SQL this is called a full table scans. Of course you can have several librarians (processors) working on the query to go faster, but you cannot have more than a few of them (let's say up to 16) inside your library (computer).

现在,Spark 已被设计用于处理大量数据,即如此庞大的库它们无法容纳在单个建筑物中,即使可以,它们也会如此之多以至于即使是 16 位图书管理员也需要几天时间查看所有内容才能回答您的第二个查询.

Now, Spark has been designed to handle large volume of data, namely libraries that are so big that they won't fit into a single buildings, and even if it does, they will be so many that even 16 librarians will take days to look at them all to answer your second query.

Spark 比 MySQL 更快的原因在于:如果你把你的书放在几栋楼里,您可以有 16 名图书管理员每栋建筑来处理您的答案.您还可以处理大量图书.

What makes Spark faster than MySQL is just this: if you put your books in several buildings, you can have 16 librarians per building working on your answer. You can also handle a larger number of books.

此外,由于 Spark 主要用于回答第二种类型的查询,而不是像请给我带来奥斯卡王尔德的‘道林格雷的肖像’"之类的查询,这意味着 Spark 不会至少在默认情况下,不关心以任何特定方式对您的书籍进行排序.这意味着,如果你想找到那本特别有火花的书,你的图书馆员就会有遍历整个图书馆以找到它.

Also, since Spark is mostly made to answer the second type of queries rather than queries like "Please bring me 'The Portrait of Dorian Gray', by Oscar Wilde", it means that Spark doesn't care, at least by default, to sort your books in any particular way. This means that if you want to find that particular book with spark, your librarians will have to go through the entire library to find it.

当然,Spark 使用了许多其他类型的优化来更有效地执行一些查询,但索引不是其中之一(如果您熟悉 mySQL 中主键的概念,那么 Spark 中没有这样的东西).其他优化包括 Parquet 和 ORC 等存储格式,它们允许您仅读取有用的列回答您的查询和压缩(例如 Snappy),旨在增加您可以容纳的书籍数量在您的图书馆中,而无需推墙.

Of course, Spark uses many other type of optimizations to perform some queries more efficiently, but indexation is not one of them (if you are familiar with the notion of Primary Key in mySQL, there are no such thing in Spark). Other optimizations include storage format like Parquet and ORC that allow you to read only the columns that are useful to answer your queries, and compression (e.g. Snappy), which are aimed at increasing the number of books you can fit in your library without having to push the walls.

我希望这个比喻对你有帮助,但请记住,这只是一个比喻,不完全符合现实.

I hope this metaphor helped you, but please bear in mind that this is just a metaphor and doesn't fit reality perfectly.

现在,回到您问题的具体细节:

假设 campaign_id 是你的主键或者你在这个列上创建了一个索引,MySQL 将只有读取 campaign_id = 77 的行.另一方面,Spark 将不得不要求 mySQL 将该表中的所有行发送到 Spark.如果 Spark 很聪明,它只会请求 campaign_id = 77 的那个,并且它可能会向 mySQL 发送多个查询以并行获取范围.但这意味着 MySQL 可以读取和聚合的所有数据都必须序列化,发送到 Spark,然后由 Spark 聚合.我希望你明白为什么这需要更长的时间.

Assuming campaign_id is your primary key or you created an index on this column, MySQL will only have to read the rows for which campaign_id = 77. On the other hand, Spark will have to ask for mySQL to send all the rows in that table to Spark. If Spark is clever, it will only ask for the one with campaign_id = 77, and maybe it will send multiple queries to mySQL to get ranges in parallel. But this means that all the data that MySQL could just read and aggregate will have to be serialized, sent to Spark, and be aggregated by Spark. I hope you see why this should take longer.

如果您希望 Spark 比 MySQL 更快地回答您的查询,您应该尝试以另一种格式复制您的表格.

If you want Spark to answer your queries faster than MySQL, you should try copying your table in another format like this.

// replace this line :
// jdbcDF.createOrReplaceTempView("call_stats")
// with :
jdbcDF.write.format("orc").saveAsTable("call_stats")

您可以尝试的另一件事是像这样缓存数据:

Another thing you could try is caching your data like this:

jdbcDF.cache().createOrReplaceTempView("call_stats")

缓存不会为第一个查询带来任何改进,因为它会在执行时缓存数据,但如果您继续查询相同的视图,它可能会更快.但正如我上面解释的,这并不意味着 Spark 在所有方面都比 mySQL 快.

Caching won't be bring any improvement for the first query as it will cache the data while performing it, but if you continue querying the same view, it might be faster. But as I explained above, this doesn't mean Spark will be faster than mySQL for everything.

对于小数据和本地部署,您还可以通过更改此配置来获得性能改进参数:spark.sql.shuffle.partitions=4,默认为200.

For small data and local deployements, you can also get a perf improvement by changing this configuration parameter: spark.sql.shuffle.partitions=4 which is 200 by default.

希望这会有所帮助.

这篇关于为什么spark仍然比mysql慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆