Python pandas 从带有短语的单元格中提取带连字符的单词 [英] Python pandas extracting hyphenated words from cells with phrases

查看：75 发布时间：2020/5/18 1:06:08 python regex pandas nlp

本文介绍了Python pandas 从带有短语的单元格中提取带连字符的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含短语的数据框，我只想从数据框中提取由连字符分隔的复合词，然后将其放在另一个数据框中.

I have a dataframe which contain phrases and I want to extract only compound words separated by a hyphen from the dataframe and place them in another dataframe.

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})

到目前为止，这是我到目前为止得到的:

So far here is what I got so far:

import pandas as pd

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})


new = df['Phrases'].str.extract("(?P<part1>.*?)-(?P<part2>.*)")

结果

>>> new
            part1        part2
0  Trail 1 Yellow        Green
1        Kim Jong  il was here
2             NaN          NaN
3          methyl       butane
4         Derp da    derp derp
5             Pok        e-mon

我想要的只是这个词，所以它应该是(请注意，由于2个连字符，Pok-e-mon出现为Nan):

What I want is to have just the word so it would be(note that Pok-e-mon appears as Nan due to 2 hyphens):

>>> new
            part1        part2
0          Yellow        Green
1             Jong          il
2             NaN          NaN
3          methyl       butane
4              da         derp
5             NaN          NaN

推荐答案

您可以使用此正则表达式:

You can use this regex:

(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)

(?:               # non capturing group
    [^-\w]|^        # a non-hyphen or the beginning of the string
)
(?P<part1>
    [a-zA-Z]+     # at least a letter
)-(?P<part2>
    [a-zA-Z]+
)
(?:[^-\w]|$)        # either a non-hyphen character or the end of the string

您的第一个问题是，没有什么可以阻止.占用空间. [a-zA-Z]仅选择字母，这样可以避免从一个单词跳到另一个单词.
对于pok-e-mon情况，您需要检查比赛前后是否没有连字符.

Your first problem is that nothing prevents the . from eating up spaces. [a-zA-Z] only select letters so it will avoid "jumping" from one word to another.
For the pok-e-mon case, you need to check that there isn't a hyphen right before or after your match.

请参见此处演示

这篇关于Python pandas 从带有短语的单元格中提取带连字符的单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python pandas 从带有短语的单元格中提取带连字符的单词 [英] Python pandas extracting hyphenated words from cells with phrases

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python pandas 从带有短语的单元格中提取带连字符的单词 [英] Python pandas extracting hyphenated words from cells with phrases

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭