如何使用Pandas将多个csv文件中的单个数据列合并为一个? [英] How to merge single data column from multiple csv files into one with Pandas?

查看:1071
本文介绍了如何使用Pandas将多个csv文件中的单个数据列合并为一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将40个几乎相似的csv文件中的单个数据列与Pandas合并.这些文件包含Windows任务列表"命令生成的csv格式的Windows进程的信息.

我想要做的是,通过使用PID作为键,将这些文件中的内存信息合并到一个文件中.但是,偶尔会出现一些随机的无关紧要的过程,但会导致csv文件之间的不一致.这意味着在某些文件中可能有65行,而在某些文件中可能有75行.但是,这些随机过程并不重要,并且更改的PID无关紧要,并且在合并文件时也应将其删除.

这是我第一次尝试这样做的方式:

# CSV files have following columns
# Image Name, PID, Session Name, Session #, Mem Usage

file1 = pd.read_csv("tasklist1.txt")
file1 = file1.drop(file1.columns[[2,3]], axis=1)    

for i in range(2,41):

    filename = "tasklist" + str(i) + ".txt"

    filei = pd.read_csv(filename)
    filei = filei.drop(filei.columns[[0,2,3]], axis=1)

    file1 = file1.merge(filei, on='PID')


file1.to_csv("Final.txt", index=False)

从第一个csv文件中,我仅删除Session Name和Session#列,但保留图像名称与每一行的标题相同.然后,从下面的csv文件中,我只保留PID和Mem Usage列,并尝试将以前一直增长的csv文件与即将出现的文件中的数据合并.

这里的问题是,当循环进行第5次迭代时,由于我收到仅对具有唯一值的Index对象有效的重新索引"错误,它不再能够合并文件.

因此,我可以在第一个循环中将第1个文件与第2个文件合并到第4个文件.如果然后创建第二个循环,将第5个文件合并到第6个至第8个文件,然后将这两个合并的文件合并在一起,则文件1至8中的所有数据都将完全合并.

有没有建议如何执行这种链式合并而不创建x个额外的循环?在这一点上,我正在尝试40个文件,实际上可以通过使用嵌套循环的蛮力方式完成整个过程,但是这不是首先有效的合并方式,如果我需要扩展此范围以进行合并,则这是不可接受的.甚至更多的文件.

解决方案

重复的列名将导致此错误.

因此,您可以在函数中添加参数suffixes merge :

后缀:2个长度的序列(元组,列表,...)

后缀分别应用于左侧和右侧的重叠列名称

重叠值列.

I'm trying to merge a single data column from 40 almost similar csv files with Pandas. The files contains info from windows processes in csv form generated by Windows 'Tasklist' command.

What I want to do is, to merge the memory information from these files into a single file by using the PID as the key. However there are some random insignificant processes that appear every now and then, but cause inconsistency among the csv files. Meaning that in some file there might be 65 rows and in some files 75 rows. However those random processes are not significant and their changing PID should not matter and they should also be dropped off when merging the files.

This is how I first tried to do it:

# CSV files have following columns
# Image Name, PID, Session Name, Session #, Mem Usage

file1 = pd.read_csv("tasklist1.txt")
file1 = file1.drop(file1.columns[[2,3]], axis=1)    

for i in range(2,41):

    filename = "tasklist" + str(i) + ".txt"

    filei = pd.read_csv(filename)
    filei = filei.drop(filei.columns[[0,2,3]], axis=1)

    file1 = file1.merge(filei, on='PID')


file1.to_csv("Final.txt", index=False)

From the first csv file I just drop the Session Name and Session # columns, but keep the Image Names just as the titles for each row. Then from the following csv files I just keep the PID and Mem Usage columns and try to merge the previous all the time growing csv file with the data from upcoming file.

The problem here is that when the loop comes to 5th iteration, it cannot merge the files anymore as I get the "Reindexing only valid with uniquely valued Index objects" error.

So I can merge 1st file with 2nd to 4th inside the first loop. If I then create second loop where I merge the 5th file to 6th to 8th file and then merge these two merged files together, all the data from files 1 to 8 will be merged just perfectly fine.

Any suggestion how to perform this kind of chained merge without creating x amount of additional loops? At this point I'm experimenting with 40 files and could actually go through the whole process by brute force this with nested loops, but that isn't effective way of merging in the first place and unacceptable, if I need to scale this to merge even more files.

解决方案

Duplicate column names will cause this error.

So you can add parameter suffixes in function merge:

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

Overlapping value columns.

这篇关于如何使用Pandas将多个csv文件中的单个数据列合并为一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆