连接 pandas 数据框中的所有列 [英] Concatenate all columns in a pandas dataframe

查看:84
本文介绍了连接 pandas 数据框中的所有列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个pandas数据框,它们的列数可能不同,并且这些列的数量通常在50到100之间变化.我需要创建一个最终列,该列只是将所有列串联在一起的.基本上,该列第一行中的字符串应该是所有列第一行中的字符串的总和(并置).我在下面编写了循环,但我觉得可能有更好的更有效的方法来执行此操作.有关如何执行此操作的任何想法

I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this

num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
    print('Concatenating', col_names[each_col_ind])
    df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]

推荐答案

使用 sum ,但输出为float,因此必须转换为intstr:

Solution with sum, but output is float, so convert to int and str is necessary:

df['new'] = df.sum(axis=1).astype(int).astype(str)

具有apply函数join的另一种解决方案,但它是最慢的:

Another solution with apply function join, but it the slowiest:

df['new'] = df.apply(''.join, axis=1)

最后非常快的numpy solution-转换为numpy array,然后总和" :

Last very fast numpy solution - convert to numpy array and then 'sum':

df['new'] = df.values.sum(axis=1)

时间:

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

cols = list('ABC')

#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop

In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop

In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop

In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop

EDIT如果某些列的dtype不是object(显然是string),则由

EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:

df['new'] = df.astype(str).values.sum(axis=1)

这篇关于连接 pandas 数据框中的所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆