根据两列中的文本拆分行(Python,Pandas) [英] Split rows according to text in two columns (Python, Pandas)
问题描述
这是我的数据框(还有更多字母,长度约为35.5k),其中-是其他相关的字符串.所有变量都是字符串,并且['C1','C2']是MultiIndex.
This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.
tmp
C1 C2 C3 C4 C5 Start End C8
A 1 - - - 12 14 -
A 2 - - - 1,4,7 3,6,10 -
A 3 - - - 16,19 17,21 -
A 4 - - - 22 24 -
我需要它成为它(将包含逗号的每一行拆分,以维护其他所有内容):
I need it to become this (split every row that contains commas maintaining everything else):
C1 C2 C3 C4 C5 Start End C8 Appearance
A 1 - - - 12 14 - 1
A 2 - - - 1 3 - 1
A 2 - - - 4 6 - 2
A 2 - - - 7 10 - 3
A 3 - - - 16 17 - 1
A 3 - - - 19 21 - 2
A 4 - - - 22 24 - 1
我尝试了这个脚本 pandas:如何在一列分成多行?
为
s = tmp['Start'].str.split(',').apply(Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Start
del tmp['Start']
final = tmp.join(s)
但是结果远远超出了预期!我得到了成千上万次重复,而这只是在尝试拆分开始".我什至无法想象要同时针对开始"和结束"执行此操作(开始"中的每个逗号都暗示着结束"中的逗号.
But then the result is much larger than it should! I get thousands of repeats and this is just trying to split 'Start'. I can't even imagine trying to do so for both Start and End (every comma in 'Start' implies a comma in 'End'.
Lengths:
tmp = 35568
s = 35676
final = 293408
推荐答案
您可以从s1
和s2
然后是 drop
删除多列:
You can create new df
from s1
and s2
and then join
. Also better is use parameter expand=True
in str.split
and delete multiple columns by drop
:
For creating column Appearance
use groupby
by index
with cumcount
.
s1 = tmp['Start'].str.split(',', expand=True).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'Start'
s2 = tmp['End'].str.split(',', expand=True).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'End'
tmp.drop(['Start', 'End'], inplace=True, axis=1)
df = pd.DataFrame({'s1':s1, 's2':s2}, index=s1.index)
final = tmp.join(df)
final['Appearance'] = final.groupby(final.index).cumcount() + 1
print (final)
C1 C2 C3 C4 C5 C8 s1 s2 Appearance
0 A 1 - - - - 12 14 1
1 A 2 - - - - 1 3 1
1 A 2 - - - - 4 6 2
1 A 2 - - - - 7 10 3
2 A 3 - - - - 16 17 1
2 A 3 - - - - 19 21 2
3 A 4 - - - - 22 24 1
通过评论
您可以先尝试reset_index
:
print (tmp)
C3 C4 C5 Start End C8
C1 C2
A 1 - - - 12 14 -
2 - - - 1,4,7 3,6,10 -
3 - - - 16,19 17,21 -
4 - - - 22 24 -
tmp.reset_index(inplace=True)
print (tmp)
C1 C2 C3 C4 C5 Start End C8
0 A 1 - - - 12 14 -
1 A 2 - - - 1,4,7 3,6,10 -
2 A 3 - - - 16,19 17,21 -
3 A 4 - - - 22 24 -
这篇关于根据两列中的文本拆分行(Python,Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!