如何在Cassandra“集群”中一次加载大量数据的一个节点? [英] How do I load a lot of data at once in a Cassandra "cluster" of one node?

查看:198
本文介绍了如何在Cassandra“集群”中一次加载大量数据的一个节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一个多网站系统,使用Cassandra来处理其所有的数据需求。

I am working on a multi website system which uses Cassandra to handle all of its data needs.

当我第一次安装网站时,它增加了3918页

When I first install a website, it adds 3918 pages (and growing) with many fields, attachments such as JS files, links between pages, etc.

在某些时候,我的测试集群(一个节点)决定数据是快速的,它超时或最坏,Cassandra崩溃,因为内存不足(OOM)。或多或少,从我可以看到Cassandra分配的2Gb的内存填满,然后,Cassandra通常不会控制其可用的RAM,并得到一个OOM。当我没有得到OOM,我得到超时。

At some point, my test "cluster" (one node) decides that the data is coming to fast and it times out or worst, Cassandra "crashes" because of an out of memory (OOM). More or less, from what I can see the 2Gb of RAM allocated by Cassandra fills up and then, more often than not, Cassandra does not control its available RAM and gets an OOM. When I don't get the OOM, I get timeouts.

在C / C ++驱动程序中有一个调用,以知道集群是否慢,等待一段时间,而不是像疯了一样推送更多的数据?

Is there a call in the C/C++ driver to know whether the "cluster" is slow so I can wait for a while instead of pushing more data like crazy?

在这一点上,我唯一可以看到的是我做一个写( INSERT INTO ... )并获取超时错误。更确切地说,这个错误: CASS_ERROR_SERVER_WRITE_TIMEOUT 。我发现它相当难看等待,直到我得到这样的错误开始起步我的 INSERT s为了管理负载。这是唯一的方法吗?

At this point, the only thing I can see is me doing a write (INSERT INTO ...) and getting a Timeout error. More precisely, this error: CASS_ERROR_SERVER_WRITE_TIMEOUT. I find it rather ugly to wait until I get such an error to start pacing my INSERTs in order to manage the load. Is that the only way?!

更新:我能够避免OOM,在首次创建网站时安装的插件(我不需要一次安装所有插件)。这不是一个好的解决方案,如果你问我,因为Cassandra节点不应该只是这样崩溃。这可能(可能会发生在许多)在生产中发生,这是不能容忍的认为,可能发生任何时候的负载在一分钟太高了一点... ...

Update: I was able to avoid the OOM, but only by reducing the number of plugins that get installed on first website creation (I do not need to have all the plugins installed at once). This is not a good solution, if you ask me, because a Cassandra node should NOT just crash like that. This could (probably does happen to many) happen in production and that's intolerable to think that could happen any time the load goes a tad bit too high for a minute...

推荐答案

单节点集群是非典型的(它们不是反模式,但它们不是主要用例)。

Single node clusters are atypical (they're not antipatterns, but they're not the primary use case). You'll have to work around some traditional behaviors.

1)使用同步查询而非异步。

1) Use sync queries instead of asynchronous.

2)确保即使在单个节点上也使用真实的一致性级别( QUORUM ),因为使用 ANY 会让您将不堪重负。

2) Make sure you use a real consistency level ( QUORUM ) even on a single node, as using ANY will let you be overwhelmed.

3)测量您自己的查询延迟。如果延迟增加超过某一点(短于完全超时),则退回插入率(人为睡眠)。

3) Measure your own query latency. If latencies increase pass a certain point (short of a full timeout), back off insertion rate (artificially sleep).

4)调整连接的cassandra侧。 2GB是相当小,要有效地运行,你需要做一些调整。你可能需要调整memtable刷新阈值以鼓励更频繁的刷新,也可以根据初始文档集的大小显式配置memtable大小。

4) Tune the cassandra side of the connection. 2GB is pretty small, to run that effectively you'll need to do some tuning. You'll probably want to tune your memtable flush thresholds to encourage more frequent flushing, and maybe explicitly configure memtable sizes based on the size of your initial document set.

这篇关于如何在Cassandra“集群”中一次加载大量数据的一个节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆