尝试通过使用h5py更改索引字段类型来缩小HDF5文件的大小 [英] Trying to size down HDF5 File by changing index field types using h5py

查看:50
本文介绍了尝试通过使用h5py更改索引字段类型来缩小HDF5文件的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的CSV文件(〜12Gb),看起来像这样:

I have a very large CSV File (~12Gb) that looks something like this:

posX,posY,posZ,eventID,parentID,clockTime-117.9853515625,60.2998046875,0.29499998688697815,0,0,0-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0-117.9560546875,60.2998046875,0.29499998688697815,0,0,0-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0

posX,posY,posZ,eventID,parentID,clockTime -117.9853515625,60.2998046875,0.29499998688697815,0,0,0 -117.9853515625,60.32909393310547,0.29499998688697815,0,0,0 -117.9560546875,60.2998046875,0.29499998688697815,0,0,0 -117.9560546875,60.32909393310547,0.29499998688697815,0,0,0 -117.92676544189453,60.2998046875,0.29499998688697815,0,0,0 -117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0 -118.04051208496094,60.34012985229492,4.474999904632568,0,0,0 -118.04051208496094,60.36941909790039,4.474999904632568,0,0,0 -118.04051208496094,60.39870834350586,4.474999904632568,0,0,0

我想使用库h5py将此CSV文件转换为HDF5格式,同时还要通过设置字段/索引类型i.G来减小文件的总大小.说:

I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:

将posX,posY和posZ保存为float32.将eventID,parentID和clockTime保存为int32或类似的内容.

Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.

注意:读入数据时,我需要以某种形式对数据进行分块,以避免内存错误.

Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.

但是我无法获得理想的结果.到目前为止我尝试过的是:使用本指南中的熊猫自有方法:

However I am unable to get the wished result. What I have tried so far: Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python? This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.

我还尝试在处理增量之前将CSV分成多个部分(使用split -n x myfile.csv).我在每个文件的开头和结尾都遇到了一些数据错误,可以通过使用sed删除这些行来修复这些文件.然后,我尝试了以下代码:

I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:

import pandas as pd
import h5py

PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)

with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
    dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")

可悲的是,这创建了文件和表,但是没有向其中写入任何数据.

Sadly this created the file and the table but didn't write any data into it.

期望创建一个包含大型CSV文件数据的HDF5文件,同时还要更改每个索引的变量类型.

Expectation Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.

如果不清楚,请向我澄清.我还是个初学者!

推荐答案

您是否考虑过 numpy 模块?它具有方便的功能( genfromtxt ),可将带有标头的CSV数据读取到Numpy数组中.您定义dtype.该数组适合使用 h5py.create_dataset()函数加载到HDF5中.

Have you considered the numpy module? It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.

请参见下面的代码.我附上了2条打印声明.第一个显示从CSV标头创建的dtype名称.第二部分显示了如何通过字段(列)名称访问numpy数组中的数据.

See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.

import h5py
import numpy as np

PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )

csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)

print (csv_data.dtype.names)
print (csv_data['posX'])

with h5py.File('SO_55576601.h5', 'w') as h5f:
    dset = h5f.create_dataset('CSV_data', data=csv_data)

h5f.close()   

这篇关于尝试通过使用h5py更改索引字段类型来缩小HDF5文件的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆