根据两列中的文本拆分行(Python,Pandas) [英] Split rows according to text in two columns (Python, Pandas)

查看:224
本文介绍了根据两列中的文本拆分行(Python,Pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据框(还有更多字母,长度约为35.5k),其中-是其他相关的字符串.所有变量都是字符串,并且['C1','C2']是MultiIndex.

This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.

tmp

C1    C2     C3    C4    C5    Start    End     C8
A     1      -      -     -    12       14      -
A     2      -      -     -    1,4,7    3,6,10  -
A     3      -      -     -    16,19    17,21   -
A     4      -      -     -    22       24      -

我需要它成为它(将包含逗号的每一行拆分,以维护其他所有内容):

I need it to become this (split every row that contains commas maintaining everything else):

C1    C2     C3    C4    C5    Start  End   C8   Appearance
A     1      -      -     -    12     14    -    1
A     2      -      -     -    1      3     -    1
A     2      -      -     -    4      6     -    2
A     2      -      -     -    7      10    -    3
A     3      -      -     -    16     17    -    1
A     3      -      -     -    19     21    -    2
A     4      -      -     -    22     24    -    1

我尝试了这个脚本 pandas:如何在一列分成多行?

s = tmp['Start'].str.split(',').apply(Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Start
del tmp['Start']
final = tmp.join(s)

但是结果远远超出了预期!我得到了成千上万次重复,而这只是在尝试拆分开始".我什至无法想象要同时针对开始"和结束"执行此操作(开始"中的每个逗号都暗示着结束"中的逗号.

But then the result is much larger than it should! I get thousands of repeats and this is just trying to split 'Start'. I can't even imagine trying to do so for both Start and End (every comma in 'Start' implies a comma in 'End'.

Lengths:
tmp   = 35568
s     = 35676
final = 293408

推荐答案

您可以从s1s2然后是 并通过 drop删除多列:

You can create new df from s1 and s2 and then join. Also better is use parameter expand=True in str.split and delete multiple columns by drop:

要创建列Appearance,请使用index

For creating column Appearance use groupby by index with cumcount.

s1 = tmp['Start'].str.split(',', expand=True).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'Start'

s2 = tmp['End'].str.split(',', expand=True).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'End'
tmp.drop(['Start', 'End'], inplace=True, axis=1)

df = pd.DataFrame({'s1':s1, 's2':s2}, index=s1.index)
final = tmp.join(df)

final['Appearance'] = final.groupby(final.index).cumcount() + 1
print (final)
  C1  C2 C3 C4 C5 C8  s1  s2  Appearance
0  A   1  -  -  -  -  12  14           1
1  A   2  -  -  -  -   1   3           1
1  A   2  -  -  -  -   4   6           2
1  A   2  -  -  -  -   7  10           3
2  A   3  -  -  -  -  16  17           1
2  A   3  -  -  -  -  19  21           2
3  A   4  -  -  -  -  22  24           1

通过评论

您可以先尝试reset_index:

print (tmp)
      C3 C4 C5  Start     End C8
C1 C2                           
A  1   -  -  -     12      14  -
   2   -  -  -  1,4,7  3,6,10  -
   3   -  -  -  16,19   17,21  -
   4   -  -  -     22      24  -

tmp.reset_index(inplace=True)
print (tmp)
  C1  C2 C3 C4 C5  Start     End C8
0  A   1  -  -  -     12      14  -
1  A   2  -  -  -  1,4,7  3,6,10  -
2  A   3  -  -  -  16,19   17,21  -
3  A   4  -  -  -     22      24  -

这篇关于根据两列中的文本拆分行(Python,Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆