如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中? [英] How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

查看：366 发布时间：2020/5/24 4:04:03 pandas hdf5 dask pytables hdfstore

本文介绍了如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从800 GB的大型CSV解析数据.对于每一行数据，我都将其另存为pandas数据框.

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.

readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])

现在，我想将其保存为HDF5格式，并查询h5，就好像它是整个csv文件一样.

Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]

到目前为止，我的方法是:

My approach so far has been:

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    store.append(hdf5_key, df, data_columns=csv_columns, index=False)

也就是说，我尝试用一个键将每个数据帧df保存到HDF5中.但是，这失败了:

That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:

  Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'

因此，我可以先将所有内容保存到一个熊猫数据框中，即

So, I could try to save everything into one pandas dataframe first, i.e.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    total_df = pd.concat([total_df, df])   # creates one big CSV

现在存储为HDF5格式

and now store into HDF5 format

    store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)

但是，我不认为我具有将所有csv行保存为total_df并转换为HDF5格式的RAM/存储空间.

However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.

那么，如何将每个单行" df附加到HDF5中，使其最终成为一个大数据帧(如原始的csv)?

So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?

这是具有不同数据类型的csv文件的具体示例:

Here's a concrete example of a csv file with different data types:

 order    start    end    value    
 1        1342    1357    category1
 1        1459    1489    category7
 1        1572    1601    category23
 1        1587    1599    category2
 1        1591    1639    category1
 ....
 15        792     813    category13
 15        892     913    category5
 ....

如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中? [英] How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中? [英] How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭