将大量数据从 Cassandra 导出到 CSV [英] Export large amount of data from Cassandra to CSV

查看:51
本文介绍了将大量数据从 Cassandra 导出到 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Cassandra 2.0.9 在一个列族中存储大量数据,比如 100Gb.我想以快速的方式将此数据导出到 CSV.我试过了:

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

  • sstable2json - 它产生相当大的 json 文件,难以解析 - 因为工具将数据放在一行中并使用复杂的模式(例如 300Mb 数据文件 = ~2Gb json),它需要一个很多时间转储,Cassandra 喜欢根据其内部机制更改源文件名
  • COPY - 导致大量记录在相当快的 EC2 实例上超时
  • CAPTURE - 像上面一样,导致超时
  • 使用分页读取 - 我使用了 timeuuid,但它每秒返回大约 1,5k 条记录
  • sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
  • COPY - causes timeouts on quite fast EC2 instances for big number of records
  • CAPTURE - like above, causes timeouts
  • reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

我使用具有快速存储、15 Gb RAM 和 4 个内核的 Amazon Ec2 实例

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

从 Cassandra 将千兆字节的数据导出到 CSV 是否有更好的选择?

Is there any better option for export gigabytes of data from Cassandra to CSV?

推荐答案

2020 年更新:DataStax 提供了一个名为 DSBulk 的特殊工具,用于从 Cassandra(从 Cassandra 2.1 开始)和 DSE(从 DSE 4.7 开始)加载和卸载数据/4.8).在最简单的情况下,命令行如下所示:

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload

DSBulk 针对加载/卸载操作进行了大量优化,并有很多选项,包括从/向压缩文件导入/导出、提供自定义查询等.

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

有一系列关于 DSBulk 的博客文章,可以提供更多信息 &示例:12, 3, 4, 5, 6

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

这篇关于将大量数据从 Cassandra 导出到 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆