在 Pandas 中合并数据帧块 [英] Merging Dataframe chunks in Pandas

查看:50
本文介绍了在 Pandas 中合并数据帧块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个脚本,可以将多个 csv 文件合并为一个,该脚本工作正常,只是当开始使用较大的文件时,我们会很快用完 ram.这是一个问题,原因之一是脚本在 AWS 服务器上运行,内存不足意味着服务器崩溃.目前,每个文件的大小限制约为 250mb,这将我们限制为 2 个文件,但是由于我工作的公司是生物技术公司并且我们使用的是基因测序文件,我们使用的文件大小范围从 17mb 到大约 700mb取决于实验.我的想法是将一个数据帧整体加载到内存中,然后将其他数据帧分块并迭代组合,但效果不佳.

I currently have a script that will combine multiple csv files into one, the script works fine except that we run out of ram really quickly when larger files start being used. This is an issue for one reason, the script runs on an AWS server and running out of RAM means a server crash. Currently the file size limit is around 250mb each, and that limits us to 2 files, however as the company I work is in Biotech and we're using Genetic Sequencing files, the files we use can range in size from 17mb up to around 700mb depending on the experiment. My idea has been to load one dataframe into memory whole and then chunk the others and combine iteratively, this didn't work so well.

我的数据框与此类似(它们的大小可能不同,但有些列保持不变;Mod"、AA"和Nuc")

My dataframes are similar to this (they can vary in size, but some columns remain the same; "Mod", "AA" and "Nuc")

+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+

当组合两个框架时,我需要它们在Mod"、Nuc"和AA"上合并,以便我有类似的东西

When combining the two frames I need them to merge on "Mod", "Nuc" and "AA" so that I have something similar to this

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

我已经有了更改标题名称的代码,所以我并不担心,但是当我使用块时,我最终会得到更接近于

I already have code to change the names of the headers so I'm not worried about that, however when I use chunks I end up with something closer to

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 | 3_1 | 3_2 | 3_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   | NA  | NA  | NA  |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | NA  | NA  | NA  | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

基本上它将每个块视为一个新文件,而不是来自同一个文件.

basically it treats each chunk as if it were a new file and not from the same one.

我知道为什么要这样做,但我不确定如何解决这个问题,现在我的分块代码非常简单.

I know why its doing that but I'm not sure on how to fix this, right now my code for chunking is really simple.

    file = "tableFile/123456.txt"
    initDF = pd.read_csv(file, sep="\t", header=0)
    file2 = "tableFile/7891011.txt"
    for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0):
        initDF = initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"])

正如你所看到的,它非常简单,正如我所说,我知道它为什么这样做,但我对 Pandas 和数据框连接都没有经验,无法修复它,因此非常感谢任何帮助.我在搜索堆栈和谷歌时也找不到类似的东西.

as you can see its pretty bare bones, as I said I know why its doing what its doing but I'm not experienced with Pandas nor with dataframe joins to be able to fix it so any help would be much appreciated. I also couldn't find anything like this while I was searching stack and on google.

推荐答案

解决方案是像你一样分块完成,但像这样将输出连接到一个新的 DataFrame 中:

The solution is to do it in chunks like you are but to concat the output into a new DataFrame like so:

file = "tableFile/123456.txt"
initDF = pd.read_csv(file, sep="\t", header=0)
file2 = "tableFile/7891011.txt"

amgPd = pd.DataFrame()              

for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): 
    amgPd = pd.concat([amgPd, initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"]]) 

这篇关于在 Pandas 中合并数据帧块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆