使用带有特定单词的 Pandas 提取句子 [英] Extracting sentences using pandas with specific words
问题描述
我有一个带有文本列的 Excel 文件.我需要做的就是从带有特定单词的每一行的文本列中提取句子.
I have a excel file with a text column. All I need to do is to extract the sentences from the text column for each row with specific words.
我尝试过使用定义函数.
I have tried using defining a function.
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
#################Reading in excel file#####################
str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")
################# Defining a function #####################
def sentence_finder(text,word):
sentences=sent_tokenize(text)
return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))
################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")
但是如果我必须找到包含多个特定单词的句子,例如 snakes
、venomous
、anaconda
,有人可以帮助我吗?句子应该至少有一个词.我无法使用多个单词来处理 nltk.tokenize
.
But can someone please help me if I have to find the sentence with multiple specific words like snakes
, venomous
, anaconda
. The sentence should have at least one word. I am not able to work around with nltk.tokenize
with multiple words.
待搜索words = ['snakes','venomous','anaconda']
输入 Excel 文件:
text
1. Snakes are venomous. Anaconda is venomous.
2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous.
4. Python is dangerous too.
期望输出:
名为 Context 的列附加到上面的文本列.上下文列应该是这样的:
Column called Context appended to the text column above. Context column should be like :
1. [Snakes are venomous.] [Anaconda is venomous.]
2. [Anaconda lives in Amazon.] [It is venomous.]
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.]
4. NULL
提前致谢.
推荐答案
方法如下:
In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])
0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3 []
Name: text, dtype: object
您会发现存在一些问题,因为 sent_tokenizer
由于标点符号而无法正常工作.
You see that there's a couple of issues, because the sent_tokenizer
didn't do it's job properly because of the punctuation.
更新:处理复数.
这是更新的 df:
text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes
df = pd.read_clipboard(sep='0')
我们可以使用词干分析器(维基百科),例如PorterStemmer.
We can use a stemmer (Wikipedia), such as the PorterStemmer.
from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()
首先,让我们对搜索的词进行词干和小写:
First, let's Stem and lowercase the searched words:
searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words
> ['snake', 'venom', 'anaconda']
现在我们可以修改上面的内容以包括词干:
Now we can do revamp the above to include stemming as well:
print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3 []
4 [I have snakes]
Name: text, dtype: object
<小时>
如果您只想匹配子字符串,请确保 searched_words 是单数,而不是复数.
If you only want substring matching, make sure searched_words is singular, not plural.
print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
for w2 in searched_words])
])
)
顺便说一句,这就是我可能会创建一个带有常规 for 循环的函数的地方,这个带有列表推导式的 lambda 已经失控了.
By the way, this is the point where I'd probably create a function with regular for loops, this lambda with list comprehensions is getting out of hands.
这篇关于使用带有特定单词的 Pandas 提取句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!