Apache Spark如何在内存中工作? [英] How does Apache Spark works in memory?

查看:72
本文介绍了Apache Spark如何在内存中工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在where子句中使用未索引的列查询Cassandra时,Spark-Cassandra-Connector的官方文档说,

When querying Cassandra with non-indexed column in the where clause, Spark-Cassandra-Connector's official documentation says,


要过滤行,您可以使用Spark提供的过滤器转换。但是,此方法导致从Cassandra获取所有行,然后由Spark过滤。

To filter rows, you can use the filter transformation provided by Spark. However, this approach causes all rows to be fetched from Cassandra and then filtered by Spark.

对此我有些困惑。例如,如果我有十亿行这样的数据库结构:ID,城市,州和国家/地区,则仅索引ID。如果我在where子句中使用City =‘Chicago’,Spark会先下载所有十亿行,然后过滤掉其中City =‘Chicago’的行吗?还是会从Cassandra中读取一些数据块,运行过滤器,存储符合条件的行,然后获取更多数据块,获取符合条件的行,然后再次将它们放在一边……然后继续该过程。如果在任何时候,RAM和/或磁盘存储空间不足,请删除/卸载/删除不符合条件的数据,然后获取新的数据块以继续执行此过程?

I am a bit confused about this. If, for example, I have a billion rows of this db structure: ID, City, State, and Country, where only ID is indexed. If I use City = 'Chicago' in where clause, would Spark first download all the billion rows, and then filter out rows where City = 'Chicago'? Or would it read some chunk of data from Cassandra, run the filter, store the rows that match the criteria, then get more chunk of data, get the rows matching the condition, and set them aside again... and continue the process. And if at any point, RAM and or Disk storage is running low, delete/offload/get rid of data that didn't match the criteria, and get the new chunk of data to continue the process?

另外,有人可以告诉我一个通用公式来计算节省一个十进制列和3个十亿行文本列需要多少磁盘空间吗?

Also, can someone tell me a general formula to calculate how much disk space would it take to save one bigdecimal column and 3 text columns of billion rows?

推荐答案

过滤行可以在数据库中或在Spark中进行。该文档建议的是,尽量尝试过滤数据库中的记录,而不是一味地进行。这意味着什么:

Filtering rows can happen either in the database or in Spark. What the documentation is recommending is to try as much as possible to filter records in the database, instead of doing it in spark. What that means:

sc.cassandraTable("test", "cars")
  .select("id", "model")
  .where("color = ?", "black")

上述语句将在数据库Cassandra中运行 color ='black'过滤器,因此Spark不会获取其用黑色以外的颜色存储任何记录。除了将数十亿条记录存储到内存中之外,Spark可能只加载几百万个在 color 列中以黑色为值的黑色。

The above statement is going to run the color = 'black' filter in Cassandra, the database, so Spark is not going to fetch into its memory any records with colors other than black. Instead of pulling the billion records into memory, Spark may be loading just a few millions that happen to have black as value in the color column.

相反,过滤可以在spark中完成:

In contrast, filtering can be done in spark:

sc.cassandraTable("test", "cars")
  .select("id", "model")
  .filter(car -> "black".equals(car.getColor()))

最后一个版本会将十亿个记录加载到Spark的内存中,然后按颜色过滤它们在Spark 中。显然,这不是以前版本的首选,后者最小化了Spark集群所需的内存量。因此,对于可以在数据库中处理的任何简单过滤,都应使用数据库/驱动程序/查询过滤器。

This last version will load all billions of records into Spark's memory, and then filter them by color in Spark. Obviously, this cannot be preferred to the previous version which minimized the amount of memory needed for the Spark cluster. So for any simple filtering that can be handled in the database, the database/driver/query filters should be used.

关于估计内存需求,还有其他问题建议的各种方法,请检查。在 spark的文档中

About estimating memory requirements, there have been other questions that proposed various approaches, please check this, and this. There's also a good suggestion in spark's documentation:


需要多少内存取决于您的应用程序。要确定您的应用程序对特定数据集大小使用的数量,请将部分数据集加载到Spark RDD中,然后使用Spark监视UI(http://:4040)的存储标签查看内存中的大小。请注意,内存使用量受存储级别和序列化格式的影响很大–请参阅调整指南以获取有关减少内存使用量的提示。

How much memory you will need will depend on your application. To determine how much your application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the Storage tab of Spark’s monitoring UI (http://:4040) to see its size in memory. Note that memory usage is greatly affected by storage level and serialization format – see the tuning guide for tips on how to reduce it.

这篇关于Apache Spark如何在内存中工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆