如何将csv添加到cassandra db? [英] how can I add csv to cassandra db?
问题描述
我知道可以用传统方式完成,但是如果我要使用Cassandra DB,是否有一种简单/敏捷且灵活的方法将csv作为一组键值对添加到DB?
能够通过CSV文件添加时间序列数据是我的首要要求。我可以切换到任何其他数据库,例如mongodb,rike(如果可以方便地在其中使用)。
编辑2 2017年12月2日
请使用端口9042。Cassandra访问已更改为CQL,默认端口为9042,9160是Thrift的默认端口。
编辑1
有一种更好的方法,无需任何编码。查看此答案 https://stackoverflow.com/a/18110080/298455
但是,如果您要进行预处理或自定义设置,则可能需要自己进行处理。这是一个冗长的方法:
-
创建列族。
cqlsh>使用strategy_class = SimpleStrategy
和strategy_options:replication_factor = 1创建键空间mykeyspace
;
cqlsh>使用mykeyspace;
cqlsh:mykeyspace>创建表stackoverflow_question
(id文本主键,名称文本,类文本);
假设您的CSV是这样的:
$ cat data.csv
id,name,class
1,hello,10
2,world,20
-
编写一个简单的Python代码以读取文件并转储到CF中。像这样:
从pycassa.pool导入csv
导入ConnectionPool
从pycassa.columnfamily导入ColumnFamily
池= ConnectionPool('mykeyspace',['localhost:9160'])
cf = ColumnFamily(pool, stackoverflow_question)
具有open('data.csv','rb')作为csvfile:
reader = csv.DictReader(csvfile)
用于阅读器中的行:
print str(row)
键= row ['id']
del row ['id']
cf.insert(键,行)
pool.dispose()
-
执行此操作:
$ python loadcsv.py
{'class':'10','id':'1','name':'hello'}
{'class':' 20','id':'2','name':'world'}
-
查看数据:
cqlsh:mykeyspace>从stackoverflow_question中选择*;
id |类名称
---- + ------- + -------
2 | 20 |世界
1 | 10 |你好
-
另请参见:
一个。注意 DictReader
b。查看 Pycassa
c。 Google将现有的CSV加载程序添加到Cassandra。我想有。
d。不知道,使用CQL驱动程序可能有更简单的方法。
e。使用适当的数据类型。我只是将它们全部包装成文本。不好。
HTH
我没有看到时间序列要求。这是时间序列的处理方法。
-
这是您的数据
$ cat data.csv
id,1383799600,1383799601,1383799605,1383799621,1383799714
1,传感器开启,传感器就绪,流出, flow-interrupt,sensor-killAll
-
创建传统的宽行。 (CQL建议不要使用 COMPACT STORAGE ,但这只是为了让您快速入门。)
cqlsh:mykeyspace>用紧凑的存储创建表时间序列
(id文本,时间戳文本,数据文本,主键(id,时间戳))
;
-
此更改后的代码:
<从$ pycassa.pool导入csv
从pycassa.column导入ConnectionPool
导入ColumnFamily
池= ConnectionPool('mykeyspace',['localhost:9160'])
cf = ColumnFamily(pool, timeseries)
,其中open('data.csv','rb')为csvfile:
读者= csv.DictReader(csvfile)
用于阅读器中的行:
print str(row)
键= row ['id']
del row ['id']
for row.iteritems()中的(时间戳,数据):
cf.insert(key,{timestamp:data})
pool.dispose( )
-
这是您的时间序列
cqlsh:mykeyspace>从时间序列中选择*;
id |时间戳|数据
---- + ------------ + ----------------
1 | 1383799600 | sensor-on
1 | 1383799601 |传感器就绪
1 | 1383799605 |流出
1 | 1383799621 |流中断
1 | 1383799714 | sensor-killAll
I know it can be done in traditional way, but if I were to use Cassandra DB, is there a easy/quick and agaile way to add csv to the DB as a set of key-value pairs ?
Ability to add a time-series data coming via CSV file is my prime requirement. I am ok to switch to any other database such as mongodb, rike, if it is conviniently doable there..
Edit 2 Dec 02, 2017
Please use port 9042. Cassandra access has changed to CQL with default port as 9042, 9160 was default port for Thrift.
Edit 1
There is a better way to do this without any coding. Look at this answer https://stackoverflow.com/a/18110080/298455
However, if you want to pre-process or something custom you may want to so it yourself. here is a lengthy method:
Create a column family.
cqlsh> create keyspace mykeyspace with strategy_class = 'SimpleStrategy' and strategy_options:replication_factor = 1; cqlsh> use mykeyspace; cqlsh:mykeyspace> create table stackoverflow_question (id text primary key, name text, class text);
Assuming your CSV is like this:
$ cat data.csv id,name,class 1,hello,10 2,world,20
Write a simple Python code to read off of the file and dump into your CF. Something like this:
import csv from pycassa.pool import ConnectionPool from pycassa.columnfamily import ColumnFamily pool = ConnectionPool('mykeyspace', ['localhost:9160']) cf = ColumnFamily(pool, "stackoverflow_question") with open('data.csv', 'rb') as csvfile: reader = csv.DictReader(csvfile) for row in reader: print str(row) key = row['id'] del row['id'] cf.insert(key, row) pool.dispose()
Execute this:
$ python loadcsv.py {'class': '10', 'id': '1', 'name': 'hello'} {'class': '20', 'id': '2', 'name': 'world'}
Look the data:
cqlsh:mykeyspace> select * from stackoverflow_question; id | class | name ----+-------+------- 2 | 20 | world 1 | 10 | hello
See also:
a. Beware of DictReader
b. Look at Pycassa
c. Google for existing CSV loader to Cassandra. I guess there are.
d. There may be a simpler way using CQL driver, I do not know.
e. Use appropriate data type. I just wrapped them all into text. Not good.
HTH
I did not see the time-series requirement. Here is how you do for time series.
This is your data
$ cat data.csv id,1383799600,1383799601,1383799605,1383799621,1383799714 1,sensor-on,sensor-ready,flow-out,flow-interrupt,sensor-killAll
Create traditional wide row. (CQL suggests not to use COMPACT STORAGE, but this is just to get you going quickly.)
cqlsh:mykeyspace> create table timeseries (id text, timestamp text, data text, primary key (id, timestamp)) with compact storage;
This the altered code:
import csv from pycassa.pool import ConnectionPool from pycassa.columnfamily import ColumnFamily pool = ConnectionPool('mykeyspace', ['localhost:9160']) cf = ColumnFamily(pool, "timeseries") with open('data.csv', 'rb') as csvfile: reader = csv.DictReader(csvfile) for row in reader: print str(row) key = row['id'] del row['id'] for (timestamp, data) in row.iteritems(): cf.insert(key, {timestamp: data}) pool.dispose()
This is your timeseries
cqlsh:mykeyspace> select * from timeseries; id | timestamp | data ----+------------+---------------- 1 | 1383799600 | sensor-on 1 | 1383799601 | sensor-ready 1 | 1383799605 | flow-out 1 | 1383799621 | flow-interrupt 1 | 1383799714 | sensor-killAll
这篇关于如何将csv添加到cassandra db?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!