pandas 在正则表达式上分裂 [英] Pandas split on regex
问题描述
我有一个包含逗号分隔特征的列的pandas df,如下所示:
I have pandas df with a column containing comma-delimited characteristics like so:
Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect
我想将此列拆分为多个虚拟变量列,但不知道如何开始此过程.我正在尝试像这样拆分列:
I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:
df['incident_characteristics'].str.split(',', expand=True)
然而,这不起作用,因为描述中间有逗号.相反,我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分.str.split 可以使用正则表达式吗?如果是这样,这是如何完成的?
This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?
我认为这个正则表达式可以满足我的需求:
I think this Regex will do what I need:
,\s[A-Z]
推荐答案
是的,split
支持正则表达式.根据您的要求,
Yes, split
supports regex. According to your requirements,
基于逗号后跟空格和大写字母的正则表达式匹配进行拆分
split based on a regex match of a comma followed by a space and a capital letter
你可以使用
df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)
请参阅正则表达式演示.
详情
\s*,\s*
- 包含 0+ 个空格的逗号(?=[A-Z])
- 仅当后跟大写 ASCII 字母时
\s*,\s*
- a comma enclosed with 0+ whitespaces(?=[A-Z])
- only if followed with an uppercase ASCII letter
但是,您似乎也不想匹配括号内的逗号,添加 (?![^()]*\))
如果匹配失败,则立即将其添加到在当前位置的右侧,除了 (
和 )
之外还有 0+ 个字符,然后是 )
:
However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\))
lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than (
and )
and then a )
:
r'\s*,\s*(?=[A-Z])(?![^()]*\))'
并且它会阻止在括号内的大写单词之前匹配逗号(里面没有括号).
and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).
请参阅另一个正则表达式演示.
这篇关于 pandas 在正则表达式上分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!