使用带有特定单词的 Pandas 提取句子 [英] Extracting sentences using pandas with specific words

查看:57
本文介绍了使用带有特定单词的 Pandas 提取句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有文本列的 Excel 文件.我需要做的就是从带有特定单词的每一行的文本列中提取句子.

I have a excel file with a text column. All I need to do is to extract the sentences from the text column for each row with specific words.

我尝试过使用定义函数.

I have tried using defining a function.

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但是如果我必须找到包含多个特定单词的句子,例如 snakesvenomousanaconda,有人可以帮助我吗?句子应该至少有一个词.我无法使用多个单词来处理 nltk.tokenize.

But can someone please help me if I have to find the sentence with multiple specific words like snakes, venomous, anaconda. The sentence should have at least one word. I am not able to work around with nltk.tokenize with multiple words.

待搜索words = ['snakes','venomous','anaconda']

输入 Excel 文件:

                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.

期望输出:

名为 Context 的列附加到上面的文本列.上下文列应该是这样的:

Column called Context appended to the text column above. Context column should be like :

 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL

提前致谢.

推荐答案

方法如下:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

您会发现存在一些问题,因为 sent_tokenizer 由于标点符号而无法正常工作.

You see that there's a couple of issues, because the sent_tokenizer didn't do it's job properly because of the punctuation.

更新:处理复数.

这是更新的 df:

text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')

我们可以使用词干分析器(维基百科),例如PorterStemmer.

We can use a stemmer (Wikipedia), such as the PorterStemmer.

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

首先,让我们对搜索的词进行词干和小写:

First, let's Stem and lowercase the searched words:

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

现在我们可以修改上面的内容以包括词干:

Now we can do revamp the above to include stemming as well:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object

<小时>

如果您只想匹配子字符串,请确保 searched_words 是单数,而不是复数.


If you only want substring matching, make sure searched_words is singular, not plural.

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

顺便说一句,这就是我可能会创建一个带有常规 for 循环的函数的地方,这个带有列表推导式的 lambda 已经失控了.

By the way, this is the point where I'd probably create a function with regular for loops, this lambda with list comprehensions is getting out of hands.

这篇关于使用带有特定单词的 Pandas 提取句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆