如何将数据从 Cassandra 表复制到另一个结构以获得更好的性能 [英] How to copy data from a Cassandra table to another structure for better performance

查看:32
本文介绍了如何将数据从 Cassandra 表复制到另一个结构以获得更好的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一些地方,建议根据我们将要对它们执行的查询来设计我们的 Cassandra 表.在 DataScale 的这篇文章 中,他们声明:

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:

事实是,在 Cassandra 中拥有许多具有相似数据的相似表是一件好事.将主键限制为您将要搜索的内容.如果您计划使用相似但不同的条件搜索数据,请将其设为单独的表.以不同的方式存储相同的数据没有任何缺点.重复数据是您在 Cassandra 中的朋友.

The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.

[...]

如果需要将同一条数据存放在14张不同的表中,那就写出14次.多次写入没有障碍.

If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.

我已经理解了这一点,现在我的问题是:假设我有一个现有的表

I have understood this, and now my question is: provided that I have an existing table, say

CREATE TABLE invoices (
    id_invoice int PRIMARY KEY,
    year int,
    id_client int,
    type_invoice text
)

但我想按年份查询并输入,所以我想要像

But I want to query by year and type instead, so I'd like to have something like

CREATE TABLE invoices_yr (
    id_invoice int,
    year int,
    id_client int,
    type_invoice text,
    PRIMARY KEY (type_invoice, year)
)

id_invoice为分区键,year为聚类键,将数据从一张表复制到另一张表的首选方式是什么稍后执行优化查询?

With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?

我的 Cassandra 版本:

My Cassandra version:

user@cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]

推荐答案

为了回应有关 COPY 命令的内容,它是解决此类问题的绝佳解决方案.

To echo what was said about the COPY command, it is a great solution for something like this.

但是,我不同意有关 Bulk Loader 的说法,因为它非常难以使用.具体来说,因为您需要在每个节点上运行它(而 COPY 只需要在单个节点上运行).

However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).

为了帮助对更大的数据集进行 COPY 扩展,您可以使用 PAGETIMEOUTPAGESIZE 参数.

To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.

COPY invoices(id_invoice, year, id_client, type_invoice) 
  TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;

适当使用这些参数,我之前使用 COPY 成功导出/导入了 3.7 亿行.

Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.

有关更多信息,请查看标题为:新选项的文章和更好的 cqlsh 复制性能.

For more info, check out this article titled: New options and better performance in cqlsh copy.

这篇关于如何将数据从 Cassandra 表复制到另一个结构以获得更好的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆