如何在不连接的情况下读取Python数据帧中的数据? [英] How to read data in Python dataframe without concatenating?

查看:78
本文介绍了如何在不连接的情况下读取Python数据帧中的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分块读取文件f(文件大小:85GB)到数据帧.建议使用以下代码.

I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested.

chunksize = 5
TextFileReader = pd.read_csv(f, chunksize=chunksize)

但是,这段代码为我提供了TextFileReader,而不是dataframe.另外,由于内存限制,我不想连接这些块以将TextFileReader转换为数据帧.请告知.

However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

推荐答案

当您尝试处理85GB CSV文件时,如果尝试通过将所有数据拆分为多个块并将其转换为数据帧来读取所有数据,则它将命中内存限制.您可以尝试使用其他方法来解决此问题.在这种情况下,您可以对数据使用过滤操作.例如,如果您的数据集中有600列,而您只对50列感兴趣.尝试从文件中仅读取50列.这样,您将节省大量内存.在读取行时对其进行处理.如果需要首先过滤数据,请使用生成器功能. yield 将一个函数变成一个生成器函数,这意味着直到您开始遍历它之前,它不会做任何工作.

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. yield makes a function a generator function, which means it won't do any work until you start looping over it.

有关生成器功能的更多信息:读取巨大的.csv文件

For more information regarding generator function: Reading a huge .csv file

有关有效过滤的信息,请参见: https://codereview.stackexchange.com/questions/88885/ficiently-filter-a-large-100gb-csv-file-v3

For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

用于处理较小的数据集:

For processing smaller dataset:

方法1:将阅读器对象直接转换为数据框:

full_data = pd.concat(TextFileReader, ignore_index=True)

有必要添加参数

It is necessary to add parameter ignore index to function concat, because avoiding duplicity of indexes.

方法2: 使用Iterator或get_chunk将其转换为数据框.

通过将块大小指定为read_csv,返回值将是TextFileReader类型的可迭代对象.

By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader.

df=TextFileReader.get_chunk(3)

for chunk in TextFileReader:
    print(chunk)

来源: http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

df = pd.DataFrame(TextFileReader.get_chunk(1))

这会将一个块转换为数据帧.

This will convert one chunk to dataframe.

检查TextFileReader中的块总数

number_of_chunks=0

for chunk in TextFileReader:
   number_of_chunks=number_of_chunks+1 


print(number_of_chunks)

如果文件更大,我不建议第二种方法.例如,如果csv文件包含100000条记录,则chunksize = 5将创建20,000个块.

If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.

这篇关于如何在不连接的情况下读取Python数据帧中的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆