在python中按列连接两个大文件 [英] Join two large files by column in python

查看:106
本文介绍了在python中按列连接两个大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个文件,每个文件具有38374732行,每个文件的大小为3.3G.我正在尝试将他们加入第一列.为此,我决定将pandas与从Stackoverflow提取的以下代码一起使用:

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:

 import pandas as pd
 import sys
 a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
 b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
 chunksize = 10 ** 6
 for chunk in a(chunksize=chunksize):
   merged = chunk.merge(b, on='Bin_ID')
   merged.to_csv("output.csv", index=False,sep='\t')

但是我遇到内存错误(不足为奇).我查看了带有大块大熊猫代码的代码(类似这样的

However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?

推荐答案

在其他帖子(例如您提到的帖子)中已经对此进行了讨论( ,或).

This is already discussed in other posts like the one you mentioned (this, or this, or this).

如此处所述,我将尝试使用黄昏数据框加载数据并执行合并,但是您可能仍无法运行,具体取决于您的PC.

As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.

最小工作示例:

import dask.dataframe as dd

# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')

# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()

# Save the merged dataframe
df.to_csv('merged.csv', index=False)

这篇关于在python中按列连接两个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆