我想在python中将非常大的csv数据转换为hdf5 [英] I want to convert very large csv data to hdf5 in python

查看:217
本文介绍了我想在python中将非常大的csv数据转换为hdf5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的csv数据非常大.看起来像这样.

I have a very large csv data. It looks like this.

[日期,公司名称,值1,值2,...,值60]

[Date, Firm name, value 1, value 2, ..., value 60]

我想将其转换为hdf5文件.例如,假设我有两个日期(2019-07-01,2019-07-02),每个日期都有3个公司(公司1,公司2,公司3),每个公司都有[值1,值2,......值60].

I want to convert this to a hdf5 file. For example, let's say I have two dates (2019-07-01, 2019-07-02), each date has 3 firms (firm 1, firm 2, firm 3) and each firm has [value 1, value 2, ... value 60].

我想将日期和公司名称作为一个组使用.具体来说,我要使用以下层次结构:日期/公司名称".

I want to use date and firm name as a group. Specifically, I want this hierarchy: 'Date/Firm name'.

例如,2019-07-01具有公司1,公司2和公司3.当您查看每个公司时,有很多[值1,值2,...值60] s.

For example, 2019-07-01 has firm 1, firm 2, and firm 3. When you look at each firm, there are many [value 1, values 2, ... value 60]s.

有什么想法吗?

谢谢.

推荐答案

有很多方法可以解决此问题.在显示一些代码之前,有一个建议:仔细考虑您的数据模式.这很重要.这将影响您访问和使用数据的难易程度.例如,您建议的模式使在一个日期访问一个公司的数据变得容易.如果您想要一个公司在一定日期范围内的所有数据怎么办?还是您希望某一天所有公司的所有数据?两者都将要求您在访问数据后操纵多个阵列.

There are A LOT of ways to approach this problem. Before I show some code, a suggestion: Consider your data schema carefully. It is important. It will affect how easily you access and use the data. For example, your proposed schema makes it easy to access the data for one Firm for one Date. What if you want all the data for one Firm for across a range of dates? Or you want all the data for all firms for one date? Both will require you to manipulate multiple arrays after you access the data.

尽管直观,但您可能希望将CSV数据存储为单个组/数据集.我将在下面的2种方法中展示每种方法的示例.以下两种方法都使用 np.genfromtxt 读取CSV数据.可选参数 names = True 将从CSV文件的第一行读取标头(如果有的话).如果没有标题行,请省略 names = ,您将获得默认的字段名称( f1,f2,f3等).我的样本数据包含在最后.

Although counter intuitive, you may want to store the CSV data as a single Group/Dataset. I will show an example of each in the 2 methods below. Both methods below use np.genfromtxt to read the CSV data. The optional parameter names=True will read the headers from row one in your CSV file if you have them. Omit names= if you don't have a header row and you will get default field names (f1, f2, f3, etc). My sample data is included at the end.

方法1:使用h5py
组名:日期
数据集名称:公司

Method 1: using h5py
Group Names: Date
Dataset Names: Firms

import numpy as np
import h5py

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with h5py.File('SO_57120995.h5','w') as h5f :

    for row in csv_recarr:   
        date=row[0]
        grp = h5f.require_group(date)

        firm=row[1]
    # convert row data to get list of all valuei entries
        row_data=row.item()[2:]
        h5f[date].create_dataset(firm,data=row_data)

方法2:使用PyTables
所有存储在数据集中的数据:/CSV_Data

Method 2: using PyTables
All data stored in Dataset: /CSV_Data

import numpy as np
import tables as tb

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with tb.File('SO_57120995_2.h5','w') as h5f :
    # this should work, but only first string character is loaded:
    #dset = h5f.create_table('/','CSV_Data',obj=csv_recarr)
    # create empty table
    dset = h5f.create_table('/','CSV_Data',description=csv_recarr.dtype)

    #workaround to add CSV data one line at a time
    for row in csv_recarr:
        append_list=[]
        append_list.append(row.item()[:])
        dset.append(append_list)

# Example to extract array of data based on field name
    firm_arr = dset.read_where('Firm==b"Firm1"')
    print (firm_arr)

示例数据:

Date,Firm,value1,value2,value3,value4,value5,value6,value7,value8,value9,value10
2019-07-01,Firm1,7.634758e-01,5.781637e-01,8.531480e-01,8.823769e-01,5.780567e-01,3.587480e-01,4.065076e-01,8.520372e-02,3.392133e-01,1.104916e-01
2019-07-01,Firm2,6.457887e-01,6.150677e-01,3.501075e-01,8.886556e-01,5.379832e-01,4.561159e-01,4.773242e-01,7.302280e-01,6.018719e-01,3.835672e-01
2019-07-01,Firm3,3.641129e-01,8.356681e-01,7.783146e-01,1.735361e-01,8.610319e-01,1.360989e-01,5.025533e-01,5.292365e-01,4.964461e-01,7.309130e-01
2019-07-02,Firm1,4.128258e-01,1.339008e-01,3.530394e-02,5.293509e-01,3.608783e-01,6.647519e-01,2.898612e-01,5.632466e-01,5.981161e-01,9.149318e-01
2019-07-02,Firm2,1.037654e-01,3.717925e-01,4.876283e-01,5.852448e-01,4.689806e-01,2.508458e-01,7.243468e-02,3.510882e-01,8.290331e-01,7.808357e-01
2019-07-02,Firm3,8.443163e-01,5.408783e-01,8.278920e-01,8.454836e-01,7.331165e-02,4.167235e-01,6.187155e-01,6.114338e-01,2.299935e-01,5.206390e-01
2019-07-03,Firm1,2.281612e-01,2.660087e-02,3.809895e-01,8.032823e-01,2.492683e-03,9.600432e-02,5.059484e-01,1.795972e-01,2.174838e-01,3.578077e-01
2019-07-03,Firm2,2.403236e-01,1.497736e-01,7.357259e-01,2.501746e-01,2.826287e-01,3.335158e-01,7.742914e-01,1.773830e-01,8.407694e-01,7.466135e-01
2019-07-03,Firm3,8.806318e-01,1.164414e-01,6.791358e-01,4.752967e-01,3.695451e-01,9.728813e-01,3.553896e-01,2.559315e-01,6.942147e-01,2.701471e-01
2019-07-04,Firm1,2.153168e-01,5.169252e-01,5.136280e-01,7.517068e-01,1.977217e-01,7.221689e-01,5.877799e-01,9.099813e-02,9.073012e-03,5.946624e-01
2019-07-04,Firm2,8.275230e-01,9.725115e-01,5.218725e-03,7.728741e-01,4.371698e-01,3.593862e-02,3.448388e-01,7.443235e-01,2.606604e-01,9.888835e-02
2019-07-04,Firm3,8.599242e-01,8.336458e-01,1.451350e-01,9.777518e-02,3.335788e-01,1.117006e-01,9.105203e-01,3.478112e-01,8.948065e-01,3.105299e-01

这篇关于我想在python中将非常大的csv数据转换为hdf5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆