Python将Cassandra数据读入 pandas [英] Python read Cassandra data into pandas
本文介绍了Python将Cassandra数据读入 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
将Cassandra数据读入熊猫的正确且最快的方法是什么?现在,我使用以下代码,但是它非常慢...
What is the proper and fastest way to read Cassandra data into pandas? Now I use the following code but it's very slow...
import pandas as pd
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
auth_provider=auth_provider)
session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory
sql_query = "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)
df = pd.DataFrame()
for row in session.execute(sql_query):
df = df.append(pd.DataFrame(row, index=[0]))
df = df.reset_index(drop=True).fillna(pd.np.nan)
读取1000行需要1分钟,而我还有一个更多"的地方... 如果我运行相同的查询,例如.在DBeaver中,我在一分钟内就能得到全部结果(约4万行).
Reading 1000 rows takes 1 minute, and I have a "bit more"... If I run the same query eg. in DBeaver, I get the whole results (~40k rows) within a minute.
谢谢!
推荐答案
I got the answer at the official mailing list (it works perfectly):
尝试定义自己的熊猫行工厂:
try to define your own pandas row factory:
def pandas_factory(colnames, rows):
return pd.DataFrame(rows, columns=colnames)
session.row_factory = pandas_factory
session.default_fetch_size = None
query = "SELECT ..."
rslt = session.execute(query, timeout=None)
df = rslt._current_rows
这就是我这样做的方式-它应该更快...
That's the way i do it - an it should be faster...
如果您找到更快的方法-我对:)
If you find a faster method - i'm interested in :)
迈克尔
这篇关于Python将Cassandra数据读入 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文