切片/分割线系列在不同位置 [英] Slice/split string Series at various positions

查看:60
本文介绍了切片/分割线系列在不同位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望拆分字符串系列取决于某些子字符串的长度在不同点:

I'm looking to split a string Series at different points depending on the length of certain substrings:

In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]: 
0    6
1    7
2    7
dtype: int64
In [50]: df
Out[50]: 
      group_class
0    group9class1
1   group10class2
2  group11class20

我的输出应如下所示:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20

我半想这可能有效:

In [56]: df.group_class.str[:split_locations]
Out[56]: 
0   NaN
1   NaN
2   NaN

如何按split_locations中的可变位置对字符串进行切片?

How can I slice my strings by the variable locations in split_locations?

推荐答案

这可行,通过使用双[[]],您可以访问当前元素的索引值,以便可以索引到split_locations系列中:

This works, by using double [[]] you can access the index value of the current element so you can index into the split_locations series:

In [119]:    
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
         0        1
0   class1   group9
1   class2  group10
2  class20  group11

或者按照@ajcr的建议,您可以extract:

Or as @ajcr has suggested you can extract:

In [106]:

df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
     group    class
0   group9   class1
1  group10   class2
2  group11  class20

编辑

正则表达式说明:

正则表达式来自@ajcr(谢谢!),它使用

the regex came from @ajcr (thanks!), this uses str.extract to extract groups, the groups become new columns.

所以 ?P<group> 此处标识了要查找的特定组的ID,如果缺少该ID,则将为列名返回一个整数.

So ?P<group> here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.

,因此其余部分应该是不言自明的:group[0-9]查找字符串group,后跟[]所指示的范围[0-9]中的数字,这等效于group\d,其中表示数字.

so the rest should be self-explanatory: group[0-9] looks for the string group followed by the digits in range [0-9] which is what the [] indicate, this is equivalent to group\d where \d means digit.

因此可以将其重写为:

df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')

这篇关于切片/分割线系列在不同位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆