读取大块 pandas 中的多个CSV文件 [英] Read multiple CSV files in Pandas in chunks
问题描述
当我们有多个csv文件且所有csv的总大小约为20gb时,如何分批导入和读取多个CSV?
How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?
我不想使用Spark
,因为我想在SkLearn中使用模型,所以我想要Pandas
本身的解决方案.
I don't want to use Spark
as i want to use a model in SkLearn so I want the solution in Pandas
itself.
我的代码是:
allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)
但是这失败了,因为我的路径中所有csv的总大小为17gb.
But this is failing as the total size of all the csv in my path is 17gb.
我想分块读取它,但是如果尝试这样会出现一些错误:
I want to read it in chunks but I getting some error if I try like this:
allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
df.reset_index(drop=True, inplace=True)
我得到的错误是:
无法连接类型为"的对象;只有pd.Series,pd.DataFrame和pd.Panel(已弃用)objs有效"
"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"
有人可以帮忙吗?
推荐答案
一种方法是使用pd.read_csv(file,chunksize = chunksize)对数据帧进行分块,然后读取的最后一块短于块大小,保存多余的位,然后将其添加到下一个块的第一个文件中.
One way to do this is to chunk the data frame with pd.read_csv(file, chunksize=chunksize) and then if the last chunk you read is shorter than the chunksize, save the extra bit and then add it onto the first file of the next chunk.
但是请确保读取下一个文件的较小的第一块,使其等于总块大小.
But making sure to read in a smaller first chunk of the next file so that it equals the total chunk size.
def chunk_from_files(dir, master_chunksize):
'''
Provided a directory, loops through files and chunks out dataframes.
:param dir: Directory to csv files.
:param master_chunksize: Size of chunk to output.
:return: Dataframes with master_chunksize chunk.
'''
files = os.listdir(dir)
chunksize = master_chunksize
extra_chunk = None # Initialize the extra chunk.
for file in files:
csv_file = os.path.join(dir, file)
# Alter chunksize if extra chunk is not None.
if extra_chunk is not None:
chunksize = master_chunksize - extra_chunk.shape[0]
for chunk in pd.read_csv(csv_file, chunksize=chunksize):
if extra_chunk is not None:
# Concatenate last small chunk of previous file with altered first chunk of next file.
chunk = pd.concat([chunk, extra_chunk])
extra_chunk = None
chunksize = master_chunksize # Reset chunksize.
elif chunk.shape[0] < chunksize:
# If last chunk is less than chunk size, set is as the extra bit.
extra_chunk = chunk
break
yield chunk
这篇关于读取大块 pandas 中的多个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!