从pandas DataFrame中的文本中提取子字符串作为新列 [英] Extract substring from text in a pandas DataFrame as new column

查看:1484
本文介绍了从pandas DataFrame中的文本中提取子字符串作为新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下面有一个要计算的单词"列表

I have a list of 'words' I want to count below

word_list = ['one','three']

我在pandas数据框中有一列,下面是文本.

And I have a column within pandas dataframe with text below.

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

所需的输出如下所示,其中保留了原始文本列,但仅将word_list中的单词提取到了新列中

The desired output is the following below, where it keeps the original text column, but only extracted the words in word_list to a new column

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

在python 2.7中有没有办法做到这一点?

Is there a way to do this in Python 2.7?

推荐答案

使用str.extract:

df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

word_list中的每个单词都由正则表达式分隔符|连接,然后传递给str.extract以进行正则表达式模式匹配.

Each word in word_list is joined by the regex separator | and then passed to str.extract for regex pattern matching.

re.IGNORECASE开关已打开,以进行不区分大小写的比较,并且将结果匹配项转换为小写形式以与您的预期输出匹配.

The re.IGNORECASE switch is turned on for case-insensitive comparisons, and the resultant matches are lowercased to match with your expected output.

这篇关于从pandas DataFrame中的文本中提取子字符串作为新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆