Cassandra读取基准与火花 [英] Cassandra Reading Benchmark with Spark

查看：314 发布时间：2016/11/13 14:48:56 amazon-ec2 cassandra apache-spark benchmarking spark-cassandra-connector

本文介绍了Cassandra读取基准与火花的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做Cassandra的阅读表现的基准。在测试设置步骤中，我创建了一个具有1/2/4 ec2实例和数据节点的集群。我写了1表与1亿条目（〜3 GB csv文件）。然后我启动一个Spark应用程序，使用spark-cassandra连接器将数据读入RDD。

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.

但是，我认为行为应该如下：Cassandra（Spark上的实例数量相同）使用的实例越多，读取得越快！使用写入一切似乎是正确的（如果群集2倍大，是2倍）。

However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).

但是：在我的基准测试中，

But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!

我的基准测试结果：

群集大小4：写入：1750秒/读取：360秒

Cluster-size 4: Write: 1750 seconds / Read: 360 seconds

群集大小2：写入：3446秒/读取：420秒

Cluster-size 2: Write: 3446 seconds / Read: 420 seconds

群集大小1：写入：7595秒/阅读： 284秒

Cluster-size 1: Write: 7595 seconds / Read: 284 seconds

使用CASSANDRA-STRESS工具的其他尝试

Cassandra群集上的cassandra-stress工具（大小为1/2/3/4节点），结果如下：

I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:

Clustersize    Threads     Ops/sek  Time
1              4           10146    30,1
               8           15612    30,1
              16           20037    30,2
              24           24483    30,2
             121           43403    30,5
             913           50933    31,7
2              4            8588    30,1
               8           15849    30,1
              16           24221    30,2
              24           29031    30,2
             121           59151    30,5
             913           73342    31,8
3              4            7984    30,1
               8           15263    30,1
              16           25649    30,2
              24           31110    30,2
             121           58739    30,6
             913           75867    31,8
4              4            7463    30,1
               8           14515    30,1
              16           25783    30,3
              24           31128    31,1
             121           62663    30,9
             913           80656    32,4

结果：使用4或8个线程，单节点集群的速度或快于较大的集群!!!

Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!

结果如图所示：

数据集是簇大小（1/2/3/4），x轴为线程，y轴为ops / sec。

Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.

- >这里的问题：这些结果是集群范围的结果还是这是一个本地节点的测试（因此只有一个实例

有人可以解释吗？谢谢！

Can someone give an explanation? Thank you!

推荐答案

我在每个Cassandra节点上运行spark worker运行类似的测试。

I ran a similar test with a spark worker running on each Cassandra node.

使用具有1500万行（约1.75 GB的数据）的Cassandra表，我运行spark作业以从表中创建RDD，每行作为字符串，然后打印计数

Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.

以下是我得到的时间：

1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds

因此，当spark worker与C *节点位于同一位置时，它似乎与节点数量相当好。

So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.

通过不与Cassandra共同定位您的工作，您迫使所有表数据通过网络。这将是缓慢的，也许在你的环境是一个瓶颈。如果您共同定位它们，那么您将从数据位置获益，因为spark将从每个机器本地的令牌创建RDD分区。

By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.

您也可以有一些其他瓶颈。我不熟悉EC2和它提供什么。希望它有本地磁盘存储，而不是网络存储，因为C *不喜欢网络存储。

You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

这篇关于Cassandra读取基准与火花的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Cassandra读取基准与火花 [英] Cassandra Reading Benchmark with Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Cassandra读取基准与火花 [英] Cassandra Reading Benchmark with Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭