无法使用Python导出Cassandra表 [英] Can't export Cassandra table using Python

查看:77
本文介绍了无法使用Python导出Cassandra表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python将Cassandra表导出为CSV格式。但是我做不到。但是,我可以从Python执行 select语句。我使用了以下代码:

I am trying to export Cassandra table to CSV format using Python. But I couldn't do it. However, I am able to execute 'select' statement from Python. I have used the following code:

from cassandra.cluster import Cluster
cluster = Cluster ()
session = cluster.connect('chandan') ### 'chandan' is the name of the   keyspace
## name of the table is 'emp'
session.execute(""" copy emp (id,name) to 'E:\HANA\emp.csv' with HEADER = true """ )
print "Exported to the CSV file"

在这方面请帮助我。

推荐答案

这是因为COPY不是CQL的一部分,所以对您不起作用。

This is not working for you because COPY is not a part of CQL.

COPY是仅用于cqlsh的工具。

COPY is a cqlsh-only tool.

您可以使用-e标志通过命令行或脚本来调用它:

You can invoke this via command line or script by using the -e flag:

cqlsh 127.0.0.1 -u username -p password -e "copy chandan.emp (id,name) to 'E:\HANA\emp.csv' with HEADER = true"

编辑20170106:


使用Python将Cassandra表导出为CSV格式

export Cassandra table to CSV format using Python

本质上... 如何导出整个Cassandra表?

我经常被问到这个问题。简短的答案...是 不要

I get asked this a lot. The short answer...is DON'T.

Cassandra最适合用来存储数百万甚至是数以千计的东西。数十亿行。之所以可以做到这一点,是因为它在多个节点上分配了其负载(包括操作负载和大小负载)。它不擅长诸如删除,就地更新和未绑定查询之类的事情。我告诉人们出于某些原因执行全部导出(未绑定查询)之类的事情。

Cassandra is best-used to store millions or even billions of rows. It can do this, because it distributes its load (both operational and size) over multiple nodes. What it's not good at, are things like deletes, in-place updates, and unbound queries. I tell people not to do things like full exports (unbound queries) for a couple reasons.

首先,在上运行未绑定查询在分布式环境中的大表通常不是一个好主意(将很多网络时间和流量引入查询中)。其次,您要获取一个存储在多个节点上的大型结果集,并将所有数据压缩到一个文件中……可能也不是一个好主意。

First of all, running an unbound query on a large table in a distributed environment is usually a very bad idea (introducing LOTS of network time and traffic into your query). Secondly, you're taking a large result set that is stored on multiple nodes, and condensing all of that data into a single file...probably also not a good idea.

底线:Cassandra不是关系数据库,所以为什么要像对待它一样?

话虽这么说,但有一些工具可以处理这样的事情。 Apache Spark 就是其中之一。

That being said, there are tools out there designed to handle things like this; Apache Spark being one of them.

请帮助我执行带有session.execute()语句的查询。

Please help me to execute the query with session.execute() statement.

有关使用Python的知识,那么您需要做一些事情。对于大表,您需要按令牌范围查询。您还希望分小批/每页地完成该操作,以免翻倒协调器节点。但是,为了避免重新发明轮子,我将告诉您,已经有一个工具(用Python编写)可以精确地完成此任务: cqlsh复制

If you insist on using Python, then you'll need to do a few things. For a large table, you'll want to query by token range. You'll also want to do that in small batches/pages, so that you don't tip-over your coordinator node. But to keep you from re-inventing the wheel, I'll tell you that there already is a tool (written in Python) that does exactly this: cqlsh COPY

实际上,较新版本的cqlsh COPY 具有一些功能(PAGESIZE和PAGETIMEOUT),可避免大型数据集超时。以前,我已经使用新的cqlsh成功导出3.7亿行,所以我知道可以做到这一点。

In fact the newer versions of cqlsh COPY have features (PAGESIZE and PAGETIMEOUT) that allow it to avoid timeouts on large data sets. I have used the new cqlsh to successfully export 370 million rows before, so I know it can be done.

摘要:不要重新发明轮子了。编写使用cqlsh COPY的脚本,并利用我刚才谈到的所有内容。

这篇关于无法使用Python导出Cassandra表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆