如何将数据从Cassandra表复制到另一个结构以获得更好的性能 [英] How to copy data from a Cassandra table to another structure for better performance

查看:83
本文介绍了如何将数据从Cassandra表复制到另一个结构以获得更好的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

建议在多个地方根据我们要对它们执行的查询来设计Cassandra表。在这篇由DataScale撰写的文章中,他们指出:


<事实是,在Cassandra中,拥有许多具有相似数据的相似表是一件好事。将主键限制在您要搜索的确切位置。如果计划使用类似但不同的条件搜索数据,则将其作为单独的表。相同的数据以不同的方式存储没有缺点。重复数据是您在Cassandra中的朋友。


[...]


如果您需要将相同的数据存储在14个不同的表中,然后将其写出14次。没有多重写入障碍。


我已经明白了,现在的问题是:只要我有一个现有的表,说

 创建表发票(
id_invoice int主键,
年int,
id_client int,
type_invoice文本

但是我想按年份查询并键入,所以我想

 创建表invoices_yr(
id_invoice int,
year int,
id_client int,
type_invoice文本,
主键(type_invoice,年份)

With id_invoice 作为分区键, year 作为聚类键,将数据从一个表复制到另一个表的首选方法是以便以后执行优化查询?


我的Cassandra版本:

  user @ cqlsh>显示版本; 
[cqlsh 5.0.1 |卡桑德拉3.5.0 | CQL规范3.4.0 |本机协议v4]


解决方案

以回应关于COPY命令,对于这样的事情来说是一个很好的解决方案。



但是,我将不同意Bulk Loader的说法,因为它无限地难以使用。具体来说,因为您需要在每个节点上运行它(而COPY只需要在单个节点上运行)。



为帮助COPY扩展更大的数据集,您可以使用 PAGETIMEOUT PAGESIZE 参数。

 复制发票(id_invoice,年份,id_client,type_invoice)
到'invoices.csv',其中PAGETIMEOUT = 40并且PAGESIZE = 20;

通过适当地使用这些参数,我之前已经使用COPY成功导出/导入了3.7亿行。 p>

有关更多信息,请查看标题为 cqlsh复制中的新选项和更好的性能


In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:

The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.

[...]

If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.

I have understood this, and now my question is: provided that I have an existing table, say

CREATE TABLE invoices (
    id_invoice int PRIMARY KEY,
    year int,
    id_client int,
    type_invoice text
)

But I want to query by year and type instead, so I'd like to have something like

CREATE TABLE invoices_yr (
    id_invoice int,
    year int,
    id_client int,
    type_invoice text,
    PRIMARY KEY (type_invoice, year)
)

With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?

My Cassandra version:

user@cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]

解决方案

To echo what was said about the COPY command, it is a great solution for something like this.

However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).

To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.

COPY invoices(id_invoice, year, id_client, type_invoice) 
  TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;

Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.

For more info, check out this article titled: New options and better performance in cqlsh copy.

这篇关于如何将数据从Cassandra表复制到另一个结构以获得更好的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆