Python pandas 从带有短语的单元格中提取带连字符的单词 [英] Python pandas extracting hyphenated words from cells with phrases
本文介绍了Python pandas 从带有短语的单元格中提取带连字符的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含短语的数据框,我只想从数据框中提取由连字符分隔的复合词,然后将其放在另一个数据框中.
I have a dataframe which contain phrases and I want to extract only compound words separated by a hyphen from the dataframe and place them in another dataframe.
df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})
到目前为止,这是我到目前为止得到的:
So far here is what I got so far:
import pandas as pd
df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})
new = df['Phrases'].str.extract("(?P<part1>.*?)-(?P<part2>.*)")
结果
>>> new
part1 part2
0 Trail 1 Yellow Green
1 Kim Jong il was here
2 NaN NaN
3 methyl butane
4 Derp da derp derp
5 Pok e-mon
我想要的只是这个词,所以它应该是(请注意,由于2个连字符,Pok-e-mon出现为Nan
):
What I want is to have just the word so it would be(note that Pok-e-mon appears as Nan
due to 2 hyphens):
>>> new
part1 part2
0 Yellow Green
1 Jong il
2 NaN NaN
3 methyl butane
4 da derp
5 NaN NaN
推荐答案
您可以使用此正则表达式:
You can use this regex:
(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)
(?: # non capturing group
[^-\w]|^ # a non-hyphen or the beginning of the string
)
(?P<part1>
[a-zA-Z]+ # at least a letter
)-(?P<part2>
[a-zA-Z]+
)
(?:[^-\w]|$) # either a non-hyphen character or the end of the string
- 您的第一个问题是,没有什么可以阻止
.
占用空间.[a-zA-Z]
仅选择字母,这样可以避免从一个单词跳到另一个单词. - 对于
pok-e-mon
情况,您需要检查比赛前后是否没有连字符. - Your first problem is that nothing prevents the
.
from eating up spaces.[a-zA-Z]
only select letters so it will avoid "jumping" from one word to another. - For the
pok-e-mon
case, you need to check that there isn't a hyphen right before or after your match.
请参见此处演示
这篇关于Python pandas 从带有短语的单元格中提取带连字符的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文