将大量数据从Cassandra导出到CSV [英] Export large amount of data from Cassandra to CSV

查看:1596
本文介绍了将大量数据从Cassandra导出到CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Cassandra 2.0.9存储大量数据,比方说100Gb,在一个列系列中。我想以快速的方式将此数据导出为CSV。我试过:




  • sstable2json - 它产生相当大的json文件,很难解析 - 在一行中并使用复杂的模式(例如300Mb数据文件=〜2Gb json),转储需要很多时间,Cassandra喜欢根据其内部机制更改源文件名。

  • COPY - 对于大量记录,在相当快速的EC2实例上导致超时

  • CAPTURE
  • 使用分页读取 - 我使用timeuuid,但每秒会返回约1,5k条记录



我使用带有快速存储的Amazon Ec2实例,15 Gb内存和4个核心



有更好的选择从Cassandra导出千兆字节的数据to CSV?

解决方案

因为使用COPY将是非常具有挑战性的,当你试图导出一个表,所以我所做的是创建简单的工具,通过chunk(分页)从cassandra表中获取数据块,并将其导出到CSV。



查看我的< =https://moshimon.wordpress.com/2015/01/19/export-data-from-cassandra-to-csv/ =nofollow>示例解决方案使用来自datastax的java库。 / p>

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

  • sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
  • COPY - causes timeouts on quite fast EC2 instances for big number of records
  • CAPTURE - like above, causes timeouts
  • reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?

解决方案

Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

Look at my example solution using java library from datastax.

这篇关于将大量数据从Cassandra导出到CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆