pandas 在正则表达式上分裂 [英] Pandas split on regex

查看:56
本文介绍了 pandas 在正则表达式上分裂的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含逗号分隔特征的列的pandas df,如下所示:

I have pandas df with a column containing comma-delimited characteristics like so:

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect

我想将此列拆分为多个虚拟变量列,但不知道如何开始此过程.我正在尝试像这样拆分列:

I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:

df['incident_characteristics'].str.split(',', expand=True)

然而,这不起作用,因为描述中间有逗号.相反,我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分.str.split 可以使用正则表达式吗?如果是这样,这是如何完成的?

This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?

我认为这个正则表达式可以满足我的需求:

I think this Regex will do what I need:

,\s[A-Z]

推荐答案

是的,split 支持正则表达式.根据您的要求,

Yes, split supports regex. According to your requirements,

基于逗号后跟空格和大写字母的正则表达式匹配进行拆分

split based on a regex match of a comma followed by a space and a capital letter

你可以使用

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)

请参阅正则表达式演示.

详情

  • \s*,\s* - 包含 0+ 个空格的逗号
  • (?=[A-Z]) - 仅当后跟大写 ASCII 字母时
  • \s*,\s* - a comma enclosed with 0+ whitespaces
  • (?=[A-Z]) - only if followed with an uppercase ASCII letter

但是,您似乎也不想匹配括号内的逗号,添加 (?![^()]*\)) 如果匹配失败,则立即将其添加到在当前位置的右侧,除了 () 之外还有 0+ 个字符,然后是 ):

However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\)) lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than ( and ) and then a ):

r'\s*,\s*(?=[A-Z])(?![^()]*\))'

并且它会阻止在括号内的大写单词之前匹配逗号(里面没有括号).

and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

请参阅另一个正则表达式演示.

这篇关于 pandas 在正则表达式上分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆