如何解决由于大 pandas 中的块大小导致的错误? [英] how to solve error due to chunksize in pandas?

查看:35
本文介绍了如何解决由于大 pandas 中的块大小导致的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取大型 csv 文件并运行代码.我使用块大小来做同样的事情.

I am trying to read a large csv file and run a code. I use chunk size to do the same.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)

我在代码中收到以下错误:

I get the following error in the code:

AttributeError: 'TextFileReader' object has no attribute 'index'

如何解决这个问题?

推荐答案

这些错误源于您的 pd.read_csv 调用,在这种情况下,不返回 DataFrame 对象.相反,它返回一个 TextFileReader 对象,它是一个 iterator.这本质上是因为当您将 iterator 参数设置为 True 时,返回的不是 DataFrame;它是 DataFrame 对象的 iterator,每个对象都是传递给 chunksize 参数的整数大小(在本例中为 1000000).具体到您的情况,您不能只调用 df.index 因为,简单地说,iterator 对象没有 index 属性.这并不意味着您不能访问 iterator 内的 DataFrames.这意味着您要么必须遍历迭代器一次访问一个 DataFrame,要么必须使用某种方式连接所有这些 DataFrame变成一个巨大的.

Those errors are stemming from the fact that your pd.read_csv call, in this case, does not return a DataFrame object. Instead, it returns a TextFileReader object, which is an iterator. This is, essentially, because when you set the iterator parameter to True, what is returned is NOT a DataFrame; it is an iterator of DataFrame objects, each the size of the integer passed to the chunksize parameter (in this case 1000000). Specific to your case, you can't just call df.index because, simply, an iterator object does not have an index attribute. This does not mean that you cannot access the DataFrames inside the iterator. What it means is that you would either have to loop through the iterator to access one DataFrame at a time or you would have to use some kind of way of concatenating all those DataFrames into one giant one.

如果您考虑一次只使用一个 DataFrame,那么您需要执行以下操作来打印每个 DataFrame 的索引:

If you are considering just working with one DataFrame at a time, then the following is what you would need to do to print the indexes of each DataFrame:

file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)

for df in dfs:
    print(df.index)
    # do something
    df.to_csv('output_file.csv', mode='a', index=False)

这会将 DataFrames 保存到名为 output_file.csv 的输出文件中.将 mode 参数设置为 a,操作应该附加到文件中.因此,不应覆盖任何内容.

This will save the DataFrames into an output file with the name output_file.csv. With the mode parameter set to a, the operations should append to the file. As a result, nothing should be overwritten.

但是,如果您的目标是将所有 DataFrame 连接成一个巨大的 DataFrame,那么以下可能是更好的方法:

However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame, then the following would perhaps be a better path:

file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)

giant_df = pd.concat(dfs)

print(giant_df.index)

由于您已经在此处使用了 iterator 参数,因此我假设您关心内存问题.因此,第一种策略会更好.这基本上意味着您正在利用 迭代器 在大型数据集的内存管理方面提供的优势.

Since you are already using the iterator parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators offer when it comes to memory management for large datasets.

我希望这证明是有用的.

I hope this proves useful.

这篇关于如何解决由于大 pandas 中的块大小导致的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆