连接 pandas 数据框中的所有列 [英] Concatenate all columns in a pandas dataframe
问题描述
我有多个pandas数据框,它们的列数可能不同,并且这些列的数量通常在50到100之间变化.我需要创建一个最终列,该列只是将所有列串联在一起的.基本上,该列第一行中的字符串应该是所有列第一行中的字符串的总和(并置).我在下面编写了循环,但我觉得可能有更好的更有效的方法来执行此操作.有关如何执行此操作的任何想法
I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this
num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
print('Concatenating', col_names[each_col_ind])
df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]
推荐答案
使用 sum
,但输出为float
,因此必须转换为int
和str
:
Solution with sum
, but output is float
, so convert to int
and str
is necessary:
df['new'] = df.sum(axis=1).astype(int).astype(str)
具有apply
函数join
的另一种解决方案,但它是最慢的:
Another solution with apply
function join
, but it the slowiest:
df['new'] = df.apply(''.join, axis=1)
最后非常快的numpy solution
-转换为numpy array
,然后总和" :
Last very fast numpy solution
- convert to numpy array
and then 'sum':
df['new'] = df.values.sum(axis=1)
时间:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT如果某些列的dtype不是object
(显然是string
),则由 DataFrame.astype
:
EDIT If dtypes of some columns are not object
(obviously string
s) cast by DataFrame.astype
:
df['new'] = df.astype(str).values.sum(axis=1)
这篇关于连接 pandas 数据框中的所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!