数据框中的串联数据 [英] Concatenation data in dataframe

查看:77
本文介绍了数据框中的串联数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码只给我最后一个文件的数据。concat或for循环有问题。我正在从2个文件中读取数据。每个应包含将近350行3列,这些行符合for循环中的条件。因此,最后的数据帧应给出3个数据帧近700个。

Hi the following code only gives me the data from the last file.There is a issue with the concat or the for loop. I am reading data from 2 files. each should contain nearly 350 rows 3 columns that agree the condition in the for loop. So at the end data frame should give nearly 700 by 3 data frame. but it only shows data from the last file.

import glob
from pathlib import Path

path = Path(r'C:\Users\PC\Desktop\datafiles')
filenames = path.glob('*.txt')
toconcat = []
for i in filenames:
    data1 = pd.read_csv(i, sep="\t", header=None)
    data1.columns = ['number','ab','cd','as','sd','dfg']
    dataset1 = pd.DataFrame(data1.loc[data1.number==1,['number','ab','cd']])
    toconcat.append(dataset1)

result = pd.concat(toconcat)
result

但是当我使用result.shape时,它显示700 x 3
这是什么问题?

But when i used the result.shape it shows 700 by 3 what is the issue here?

推荐答案

我甚至通过传递 keys 参数
(原点标记)来创建一个更宽泛的示例。

I created even a "wider" example, passing also keys parameter (the "origin marker").

源文件 Input_1.txt

1   ab1 cd1 as1 sd1 dfg1
1   ab2 cd2 as2 sd2 dfg2
1   ab3 cd3 as3 sd3 dfg3
2   ab4 cd4 as4 sd4 dfg4

源文件 Input_2.txt

1   ab5 cd5 as5 sd5 dfg5
1   ab6 cd6 as6 sd6 dfg6
1   ab7 cd7 as7 sd7 dfg7
2   ab8 cd8 as8 sd8 dfg8

(以上文件都用制表符分隔)。

(both the above files are Tab-separated).

和代码:

toconcat = []
keys = []
path = Path(r'C:\Users\...')  # Replace dots with your path
filenames = path.glob('*.txt')
for i in filenames:
    data1 = pd.read_csv(i, sep='\t', names=['number', 'ab', 'cd', 'as', 'sd', 'dfg'])
    dataset1 = data1.loc[data1.number==1, ['number', 'ab', 'cd']]
    toconcat.append(dataset1)
    keys.append(i.stem)
result = pd.concat(toconcat, keys=keys)
print(result)

名称可以最早在 read_csv 中传递(就像我一样)。

Note that column names can be passed as early as in read_csv (as I did).

对于我的输入文件,结果是:

The result, for my input files, is:

           number   ab   cd
Input_1 0       1  ab1  cd1
        1       1  ab2  cd2
        2       1  ab3  cd3
Input_2 0       1  ab5  cd5
        1       1  ab6  cd6
        2       1  ab7  cd7

所以您的代码看起来不错。我的代码仅在细节上有所不同,
结果包含MultiIndex,顶层显示每行
的来源,从而简化了正在发生的事情的跟踪。

So your code looks OK. My code is different in only this detail that the result contains MultiIndex, with the top level showing the origin of each row, thus easing tracing of what has been going on.

尝试我的代码和输入文件,结果应该像我的。

Try just my code and my input files, the result should be just like mine.

然后用您的文件替换我的一个文件(和运行代码)。
最后,将第二个我的文件替换为第二个文件,并再次运行
代码。

Then replace one of my files with yours (and run the code). Finally replace also the second my file with your second file and run the code again.

最后删除参数,以便在结果中具有普通索引。

Finally delete keys parameter, to have an ordinary index in the result.

错误的来源可能在其他地方。

Probably the source of your error is somewhere else.

顺便说一句:您不需要 import glob ,因为仅使用 path 中的 glob

By the way: You don't need import glob, as you use glob from path only.

这篇关于数据框中的串联数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆