我想在python中将非常大的csv数据转换为hdf5 [英] I want to convert very large csv data to hdf5 in python
问题描述
我的csv数据非常大.看起来像这样.
I have a very large csv data. It looks like this.
[日期,公司名称,值1,值2,...,值60]
[Date, Firm name, value 1, value 2, ..., value 60]
我想将其转换为hdf5文件.例如,假设我有两个日期(2019-07-01,2019-07-02),每个日期都有3个公司(公司1,公司2,公司3),每个公司都有[值1,值2,......值60].
I want to convert this to a hdf5 file. For example, let's say I have two dates (2019-07-01, 2019-07-02), each date has 3 firms (firm 1, firm 2, firm 3) and each firm has [value 1, value 2, ... value 60].
我想将日期和公司名称作为一个组使用.具体来说,我要使用以下层次结构:日期/公司名称".
I want to use date and firm name as a group. Specifically, I want this hierarchy: 'Date/Firm name'.
例如,2019-07-01具有公司1,公司2和公司3.当您查看每个公司时,有很多[值1,值2,...值60] s.
For example, 2019-07-01 has firm 1, firm 2, and firm 3. When you look at each firm, there are many [value 1, values 2, ... value 60]s.
有什么想法吗?
谢谢.
推荐答案
有很多方法可以解决此问题.在显示一些代码之前,有一个建议:仔细考虑您的数据模式.这很重要.这将影响您访问和使用数据的难易程度.例如,您建议的模式使在一个日期访问一个公司的数据变得容易.如果您想要一个公司在一定日期范围内的所有数据怎么办?还是您希望某一天所有公司的所有数据?两者都将要求您在访问数据后操纵多个阵列.
There are A LOT of ways to approach this problem. Before I show some code, a suggestion: Consider your data schema carefully. It is important. It will affect how easily you access and use the data. For example, your proposed schema makes it easy to access the data for one Firm for one Date. What if you want all the data for one Firm for across a range of dates? Or you want all the data for all firms for one date? Both will require you to manipulate multiple arrays after you access the data.
尽管直观,但您可能希望将CSV数据存储为单个组/数据集.我将在下面的2种方法中展示每种方法的示例.以下两种方法都使用 np.genfromtxt
读取CSV数据.可选参数 names = True
将从CSV文件的第一行读取标头(如果有的话).如果没有标题行,请省略 names =
,您将获得默认的字段名称( f1,f2,f3等)
.我的样本数据包含在最后.
Although counter intuitive, you may want to store the CSV data as a single Group/Dataset. I will show an example of each in the 2 methods below. Both methods below use np.genfromtxt
to read the CSV data. The optional parameter names=True
will read the headers from row one in your CSV file if you have them. Omit names=
if you don't have a header row and you will get default field names (f1, f2, f3, etc)
. My sample data is included at the end.
方法1:使用h5py
组名:日期
数据集名称:公司
Method 1: using h5py
Group Names: Date
Dataset Names: Firms
import numpy as np
import h5py
csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)
with h5py.File('SO_57120995.h5','w') as h5f :
for row in csv_recarr:
date=row[0]
grp = h5f.require_group(date)
firm=row[1]
# convert row data to get list of all valuei entries
row_data=row.item()[2:]
h5f[date].create_dataset(firm,data=row_data)
方法2:使用PyTables
所有存储在数据集中的数据:/CSV_Data
Method 2: using PyTables
All data stored in Dataset: /CSV_Data
import numpy as np
import tables as tb
csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)
with tb.File('SO_57120995_2.h5','w') as h5f :
# this should work, but only first string character is loaded:
#dset = h5f.create_table('/','CSV_Data',obj=csv_recarr)
# create empty table
dset = h5f.create_table('/','CSV_Data',description=csv_recarr.dtype)
#workaround to add CSV data one line at a time
for row in csv_recarr:
append_list=[]
append_list.append(row.item()[:])
dset.append(append_list)
# Example to extract array of data based on field name
firm_arr = dset.read_where('Firm==b"Firm1"')
print (firm_arr)
示例数据:
Date,Firm,value1,value2,value3,value4,value5,value6,value7,value8,value9,value10
2019-07-01,Firm1,7.634758e-01,5.781637e-01,8.531480e-01,8.823769e-01,5.780567e-01,3.587480e-01,4.065076e-01,8.520372e-02,3.392133e-01,1.104916e-01
2019-07-01,Firm2,6.457887e-01,6.150677e-01,3.501075e-01,8.886556e-01,5.379832e-01,4.561159e-01,4.773242e-01,7.302280e-01,6.018719e-01,3.835672e-01
2019-07-01,Firm3,3.641129e-01,8.356681e-01,7.783146e-01,1.735361e-01,8.610319e-01,1.360989e-01,5.025533e-01,5.292365e-01,4.964461e-01,7.309130e-01
2019-07-02,Firm1,4.128258e-01,1.339008e-01,3.530394e-02,5.293509e-01,3.608783e-01,6.647519e-01,2.898612e-01,5.632466e-01,5.981161e-01,9.149318e-01
2019-07-02,Firm2,1.037654e-01,3.717925e-01,4.876283e-01,5.852448e-01,4.689806e-01,2.508458e-01,7.243468e-02,3.510882e-01,8.290331e-01,7.808357e-01
2019-07-02,Firm3,8.443163e-01,5.408783e-01,8.278920e-01,8.454836e-01,7.331165e-02,4.167235e-01,6.187155e-01,6.114338e-01,2.299935e-01,5.206390e-01
2019-07-03,Firm1,2.281612e-01,2.660087e-02,3.809895e-01,8.032823e-01,2.492683e-03,9.600432e-02,5.059484e-01,1.795972e-01,2.174838e-01,3.578077e-01
2019-07-03,Firm2,2.403236e-01,1.497736e-01,7.357259e-01,2.501746e-01,2.826287e-01,3.335158e-01,7.742914e-01,1.773830e-01,8.407694e-01,7.466135e-01
2019-07-03,Firm3,8.806318e-01,1.164414e-01,6.791358e-01,4.752967e-01,3.695451e-01,9.728813e-01,3.553896e-01,2.559315e-01,6.942147e-01,2.701471e-01
2019-07-04,Firm1,2.153168e-01,5.169252e-01,5.136280e-01,7.517068e-01,1.977217e-01,7.221689e-01,5.877799e-01,9.099813e-02,9.073012e-03,5.946624e-01
2019-07-04,Firm2,8.275230e-01,9.725115e-01,5.218725e-03,7.728741e-01,4.371698e-01,3.593862e-02,3.448388e-01,7.443235e-01,2.606604e-01,9.888835e-02
2019-07-04,Firm3,8.599242e-01,8.336458e-01,1.451350e-01,9.777518e-02,3.335788e-01,1.117006e-01,9.105203e-01,3.478112e-01,8.948065e-01,3.105299e-01
这篇关于我想在python中将非常大的csv数据转换为hdf5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!