为什么Spark仍然比mysql慢? [英] why spark still slow than mysql?

查看:155
本文介绍了为什么Spark仍然比mysql慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Apache Spark与数据源MySQL一起使用.我有一个具有1个主节点和1个从属节点的群集,并且都具有8 GB内存和2个内核,我正在使用spark-shell提交我的SQL查询以进行火花检查,并且该表有 6402821 这么多行.我正在对该表执行分组依据. MySQL花费的时间是5.2秒,执行查询时使用spark的时间是 21Secs .为什么会这样呢?

I am trying to work with Apache spark with data source MySQL. I have a cluster having 1 master and 1 slave node and both have 8 GB ram and 2 cores I am submitting my SQL query to spark using spark-shell and that table having 6402821 this many rows. I am performing a group by onto that table. and time taken by MySQL is 5.2secs and using spark when I am performing query the time is 21Secs. why is this happening?

我还设置了一些配置,例如 partitionColumn,upperBound,lowerBound和numofPartitions ,但仍然没有更改.

i am also setting some configurations like partitionColumn, upperBound, lowerBound, and numofPartitions but still no change.

我也尝试过使用1,2,4个内核执行查询,但是火花所花费的时间是相同的 21秒.

I have also tried with executing the query using 1,2,4 cores but the time taken by the spark is same 21Secs.

发生此问题,因为我的MySQL数据库位于单台计算机上,并且 所有spark节点都尝试在该单台计算机上查询数据吗?

is this problem occurs because of my MySQL database is on a single machine and all spark nodes try to query on data onto that single machine?

有人可以帮助我解决此问题吗?

Can any one help me to solve this issue?

我要查询的名为demo_call_stats的表的数据库是:

the database having a table called demo_call_stats on which i am trying to query is:

val jdbcDF = spark.read.format("jdbc").options( Map("url" ->  "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zeroDateTimeBehaviour"->"convertToNull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchSize" -> "10000", "partitionColumn" -> "newpartition", "lowerBound" -> "0", "upperBound" -> "4", "numPartitions" -> "4")).load()

jdbcDF.createOrReplaceTempView("call_stats")

val sqlDF = sql("select Count(*), classification_id from call_stats where campaign_id = 77 group by classification_id")

sqlDF.show()

我们将不胜感激.

谢谢

推荐答案

在这里您应该了解以下几点:

There is a couple of things you should understand here:

尽管您可能听到了什么,Spark并不比MySQL快,仅仅是因为这种通用性没有任何意义. 对于某些查询,Spark比MySQL快,而对于其他查询,MySQL比Spark快. 一般来说,MySQL是一个关系数据库,这意味着它已经被认为可以服务 作为应用程序的后端.它经过优化,可以有效地访问记录,只要它们被索引即可.

Despite what you might have heard, Spark isn't 'faster than MySQL', simply because this kind of generality doesn't mean anything. Spark is faster than MySQL for some queries, and MySQL is faster than Spark for others. Generally speaking, MySQL is a relational database, meaning it has been conceived to serve as a back-end for an application. It is optimized to access records efficiently as long as they are indexed.

在考虑数据库时,我喜欢将它们视为一个图书馆员的图书馆,以帮助您获取所需的书籍 (我在这里谈论的是一个非常古老的学校图书馆,没有任何可帮助图书管理员的计算机.)

When thinking about databases, I like to think of them as a library with one librarian to help you get the books you want (I am speaking about a very old school library here, without any computer to help the librarian).

如果您问您的图书管理员: 我想知道您有多少本有关地缘政治的书" , 图书管理员可以转到地缘政治"书架,然后计算该书架上的书籍数量.

If you ask your librarian: "I want to know how many books you have that are about Geopolitics", the librarian can go to the Geopolitics shelf and count the number of books on that shelf.

如果您问您的图书管理员: 我想知道您有多少本书,至少有500页" , 图书馆员将不得不查看图书馆中的每一本书来回答您的查询. 在SQL中,这称为全表扫描. 当然,您可以让多个图书管理员(处理器)来处理查询,以加快查询速度, 但您的库(计算机)中最多只能有几个(最多16个).

If you ask your librarian: "I want to know how many books you have that have at least 500 pages", the librarian will have to look at every single book in the library to answer your query. In SQL this is called a full table scans. Of course you can have several librarians (processors) working on the query to go faster, but you cannot have more than a few of them (let's say up to 16) inside your library (computer).

现在,Spark已被设计为处理大量数据,即非常大的库 以至于它们无法容纳在单个建筑物中,即使可以容纳,它们也会如此之多 甚至16名馆员也会花上几天的时间来查看它们,以回答您的第二个查询.

Now, Spark has been designed to handle large volume of data, namely libraries that are so big that they won't fit into a single buildings, and even if it does, they will be so many that even 16 librarians will take days to look at them all to answer your second query.

使Spark比MySQL更快的原因是:如果您将书籍放在几栋建筑物中, 您可以在每栋楼的每个建筑物中有16位馆员 . 您还可以处理更多的书.

What makes Spark faster than MySQL is just this: if you put your books in several buildings, you can have 16 librarians per building working on your answer. You can also handle a larger number of books.

此外,由于Spark主要用于回答第二类查询,而不是诸如请带给我'奥斯卡·王尔德(Oscar Wilde)的《肖像画,多利安·格雷》'"这样的查询,因此,Spark不会至少在默认情况下,并不关心以任何特定方式对书籍进行排序. 这意味着,如果您想查找带有火花的特定书籍,您的图书馆员将拥有 遍历整个图书馆找到它.

Also, since Spark is mostly made to answer the second type of queries rather than queries like "Please bring me 'The Portrait of Dorian Gray', by Oscar Wilde", it means that Spark doesn't care, at least by default, to sort your books in any particular way. This means that if you want to find that particular book with spark, your librarians will have to go through the entire library to find it.

当然,Spark使用许多其他类型的优化来更有效地执行一些查询, 但是索引不是其中之一(如果您熟悉mySQL中的主键"概念,那么Spark中就没有这种东西). 其他优化包括Parquet和ORC等存储格式,使您仅可以读取有用的列 回答您的查询和压缩(例如Snappy),目的是增加您可以容纳的书籍数量 在您的书架中,而无需费力.

Of course, Spark uses many other type of optimizations to perform some queries more efficiently, but indexation is not one of them (if you are familiar with the notion of Primary Key in mySQL, there are no such thing in Spark). Other optimizations include storage format like Parquet and ORC that allow you to read only the columns that are useful to answer your queries, and compression (e.g. Snappy), which are aimed at increasing the number of books you can fit in your library without having to push the walls.

我希望这个隐喻对您有所帮助,但是请记住,这只是一个隐喻, 不能完全符合现实.

I hope this metaphor helped you, but please bear in mind that this is just a metaphor and doesn't fit reality perfectly.

现在,返回到您的问题的详细信息:

假设campaign_id是您的主键,或者您在此列上创建了索引,则MySQL仅具有 读取campaign_id = 77所在的行. 另一方面,Spark将不得不要求mySQL将该表中的所有行发送到Spark. 如果Spark很聪明,它将仅要求使用campaign_id = 77进行查询,并且可能会将多个查询发送到mySQL以并行获取范围. 但这意味着MySQL可以读取和聚合的所有数据都必须序列化,发送到Spark并由Spark聚合. 我希望您明白为什么需要更长的时间.

Assuming campaign_id is your primary key or you created an index on this column, MySQL will only have to read the rows for which campaign_id = 77. On the other hand, Spark will have to ask for mySQL to send all the rows in that table to Spark. If Spark is clever, it will only ask for the one with campaign_id = 77, and maybe it will send multiple queries to mySQL to get ranges in parallel. But this means that all the data that MySQL could just read and aggregate will have to be serialized, sent to Spark, and be aggregated by Spark. I hope you see why this should take longer.

如果您希望Spark比MySQL更快地回答您的查询,则应尝试以其他格式复制表.

If you want Spark to answer your queries faster than MySQL, you should try copying your table in another format like this.

// replace this line :
// jdbcDF.createOrReplaceTempView("call_stats")
// with :
jdbcDF.write.format("orc").saveAsTable("call_stats")

您可以尝试的另一件事是像这样缓存数据:

Another thing you could try is caching your data like this:

jdbcDF.cache().createOrReplaceTempView("call_stats")

对于第一个查询,缓存不会带来任何改进,因为它会在执行数据时缓存数据,但是如果您继续查询同一视图,它可能会更快. 但是如上所述,这并不意味着Spark在所有方面都会比mySQL更快.

Caching won't be bring any improvement for the first query as it will cache the data while performing it, but if you continue querying the same view, it might be faster. But as I explained above, this doesn't mean Spark will be faster than mySQL for everything.

对于小数据和本地部署,还可以通过更改此配置来提高性能 参数:spark.sql.shuffle.partitions=4,默认值为200.

For small data and local deployements, you can also get a perf improvement by changing this configuration parameter: spark.sql.shuffle.partitions=4 which is 200 by default.

希望这会有所帮助.

这篇关于为什么Spark仍然比mysql慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆