如何在python中合并大型csv文件? [英] How do I combine large csv files in python?

查看:297
本文介绍了如何在python中合并大型csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有18个csv文件,每个文件约1.6 Gb,每个文件包含约1200万行。每个文件代表一年的数据。我需要合并所有这些文件,提取特定地理位置的数据,然后分析时间序列。最好的方法是什么?

I have 18 csv files, each is approximately 1.6Gb and each contain approximately 12 million rows. Each file represents one years' worth of data. I need to combine all of these files, extract data for certain geographies, and then analyse the time series. What is the best way to do this?

我已经厌倦了使用pd.read_csv,但是我遇到了内存限制。我尝试过包含一个块大小参数,但这给了我一个TextFileReader对象,而且我不知道如何将它们组合成一个数据框。我也尝试过pd.concat,但这也不起作用。

I have tired using pd.read_csv but i hit a memory limit. I have tried including a chunk size argument but this gives me a TextFileReader object and I don't know how to combine these to make a dataframe. I have also tried pd.concat but this does not work either.

推荐答案

这是使用熊猫组合一个大熊猫的优雅方法。非常大的CSV文件。
该技术是每次迭代将行数(定义为CHUNK_SIZE)加载到内存中,直到完成。这些行将以追加模式附加到输出文件。

Here is the elegant way of using pandas to combine a very large csv files. The technique is to load number of rows (defined as CHUNK_SIZE) to memory per iteration until completed. These rows will be appended to output file in "append" mode.

import pandas as pd

CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"

for csv_file_name in csv_file_list:
    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)

但是,如果文件包含标头,则跳过即将出现的文件中的标头是有意义的,除了第一。由于重复头是意外的。在这种情况下,解决方案如下:

But If your files contain headers than it makes sense to skip the header in the upcoming files except the first one. As repeating header is unexpected. In this case the solution is as the following:

import pandas as pd

CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"

first_one = True
for csv_file_name in csv_file_list:

    if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
        skip_row = [0]
    else:
        skip_row = []

    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE, skiprows = skip_row)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)
    first_one = False

这篇关于如何在python中合并大型csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆