如何将csv添加到cassandra db? [英] how can I add csv to cassandra db?

查看:51
本文介绍了如何将csv添加到cassandra db?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道可以用传统方式完成,但是如果我要使用Cassandra DB,是否有一种简单/敏捷且灵活的方法将csv作为一组键值对添加到DB?



能够通过CSV文件添加时间序列数据是我的首要要求。我可以切换到任何其他数据库,例如mongodb,rike(如果可以方便地在其中使用)。

解决方案

编辑2 2017年12月2日

请使用端口9042。Cassandra访问已更改为CQL,默认端口为9042,9160是Thrift的默认端口。



编辑1

有一种更好的方法,无需任何编码。查看此答案 https://stackoverflow.com/a/18110080/298455



但是,如果您要进行预处理或自定义设置,则可能需要自己进行处理。这是一个冗长的方法:







  1. 创建列族。

      cqlsh>使用strategy_class = SimpleStrategy 
    和strategy_options:replication_factor = 1创建键空间mykeyspace


    cqlsh>使用mykeyspace;

    cqlsh:mykeyspace>创建表stackoverflow_question
    (id文本主键,名称文本,类文本);

    假设您的CSV是这样的:

      $ cat data.csv 
    id,name,class
    1,hello,10
    2,world,20


  2. 编写一个简单的Python代码以读取文件并转储到CF中。像这样:

     从pycassa.pool导入csv 
    导入ConnectionPool
    从pycassa.columnfamily导入ColumnFamily

    池= ConnectionPool('mykeyspace',['localhost:9160'])
    cf = ColumnFamily(pool, stackoverflow_question)

    具有open('data.csv','rb')作为csvfile:
    reader = csv.DictReader(csvfile)
    用于阅读器中的行:
    print str(row)
    键= row ['id']
    del row ['id']
    cf.insert(键,行)

    pool.dispose()


  3. 执行此操作:

      $ python loadcsv.py 
    {'class':'10','id':'1','name':'hello'}
    {'class':' 20','id':'2','name':'world'}


  4. 查看数据:

      cqlsh:mykeyspace>从stackoverflow_question中选择*; 
    id |类名称
    ---- + ------- + -------
    2 | 20 |世界
    1 | 10 |你好


  5. 另请参见:



    一个。注意 DictReader

    b。查看 Pycassa

    c。 Google将现有的CSV加载程序添加到Cassandra。我想有。

    d。不知道,使用CQL驱动程序可能有更简单的方法。

    e。使用适当的数据类型。我只是将它们全部包装成文本。不好。


HTH






我没有看到时间序列要求。这是时间序列的处理方法。


  1. 这是您的数据

      $ cat data.csv 
    id,1383799600,1383799601,1383799605,1383799621,1383799714
    1,传感器开启,传感器就绪,流出, flow-interrupt,sensor-killAll


  2. 创建传统的宽行。 (CQL建议不要使用 COMPACT STORAGE ,但这只是为了让您快速入门。)

      cqlsh:mykeyspace>用紧凑的存储创建表时间序列
    (id文本,时间戳文本,数据文本,主键(id,时间戳))


  3. 此更改后的代码:



    <从$ pycassa.pool导入csv
    从pycassa.column导入ConnectionPool
    导入ColumnFamily

    池= ConnectionPool('mykeyspace',['localhost:9160'])
    cf = ColumnFamily(pool, timeseries)

    ,其中open('data.csv','rb')为csvfile:
    读者= csv.DictReader(csvfile)
    用于阅读器中的行:
    print str(row)
    键= row ['id']
    del row ['id']
    for row.iteritems()中的(时间戳,数据):
    cf.insert(key,{timestamp:data})

    pool.dispose( )


  4. 这是您的时间序列

      cqlsh:mykeyspace>从时间序列中选择*; 
    id |时间戳|数据
    ---- + ------------ + ----------------
    1 | 1383799600 | sensor-on
    1 | 1383799601 |传感器就绪
    1 | 1383799605 |流出
    1 | 1383799621 |流中断
    1 | 1383799714 | sensor-killAll



I know it can be done in traditional way, but if I were to use Cassandra DB, is there a easy/quick and agaile way to add csv to the DB as a set of key-value pairs ?

Ability to add a time-series data coming via CSV file is my prime requirement. I am ok to switch to any other database such as mongodb, rike, if it is conviniently doable there..

解决方案

Edit 2 Dec 02, 2017
Please use port 9042. Cassandra access has changed to CQL with default port as 9042, 9160 was default port for Thrift.

Edit 1
There is a better way to do this without any coding. Look at this answer https://stackoverflow.com/a/18110080/298455

However, if you want to pre-process or something custom you may want to so it yourself. here is a lengthy method:


  1. Create a column family.

    cqlsh> create keyspace mykeyspace 
    with strategy_class = 'SimpleStrategy' 
    and strategy_options:replication_factor = 1;
    
    cqlsh> use mykeyspace;
    
    cqlsh:mykeyspace> create table stackoverflow_question 
    (id text primary key, name text, class text);
    

    Assuming your CSV is like this:

    $ cat data.csv 
    id,name,class
    1,hello,10
    2,world,20
    

  2. Write a simple Python code to read off of the file and dump into your CF. Something like this:

    import csv 
    from pycassa.pool import ConnectionPool
    from pycassa.columnfamily import ColumnFamily
    
    pool = ConnectionPool('mykeyspace', ['localhost:9160'])
    cf = ColumnFamily(pool, "stackoverflow_question")
    
    with open('data.csv', 'rb') as csvfile:
      reader = csv.DictReader(csvfile)
      for row in reader:
        print str(row)
        key = row['id']
        del row['id']
        cf.insert(key, row)
    
    pool.dispose()
    

  3. Execute this:

    $ python loadcsv.py 
    {'class': '10', 'id': '1', 'name': 'hello'}
    {'class': '20', 'id': '2', 'name': 'world'}
    

  4. Look the data:

    cqlsh:mykeyspace> select * from stackoverflow_question;
     id | class | name
    ----+-------+-------
      2 |    20 | world
      1 |    10 | hello
    

  5. See also:

    a. Beware of DictReader
    b. Look at Pycassa
    c. Google for existing CSV loader to Cassandra. I guess there are.
    d. There may be a simpler way using CQL driver, I do not know.
    e. Use appropriate data type. I just wrapped them all into text. Not good.

HTH


I did not see the time-series requirement. Here is how you do for time series.

  1. This is your data

    $ cat data.csv
    id,1383799600,1383799601,1383799605,1383799621,1383799714
    1,sensor-on,sensor-ready,flow-out,flow-interrupt,sensor-killAll
    

  2. Create traditional wide row. (CQL suggests not to use COMPACT STORAGE, but this is just to get you going quickly.)

    cqlsh:mykeyspace> create table timeseries 
    (id text, timestamp text, data text, primary key (id, timestamp)) 
    with compact storage;
    

  3. This the altered code:

    import csv
    from pycassa.pool import ConnectionPool
    from pycassa.columnfamily import ColumnFamily
    
    pool = ConnectionPool('mykeyspace', ['localhost:9160'])
    cf = ColumnFamily(pool, "timeseries")
    
    with open('data.csv', 'rb') as csvfile:
      reader = csv.DictReader(csvfile)
      for row in reader:
        print str(row)
        key = row['id']
        del row['id']
        for (timestamp, data) in row.iteritems():
          cf.insert(key, {timestamp: data})
    
    pool.dispose()
    

  4. This is your timeseries

    cqlsh:mykeyspace> select * from timeseries;
     id | timestamp  | data
    ----+------------+----------------
      1 | 1383799600 |      sensor-on
      1 | 1383799601 |   sensor-ready
      1 | 1383799605 |       flow-out
      1 | 1383799621 | flow-interrupt
      1 | 1383799714 | sensor-killAll
    

这篇关于如何将csv添加到cassandra db?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆