Python(NLTK)-提取名词短语的更有效方法? [英] Python (NLTK) - more efficient way to extract noun phrases?
问题描述
我有一个涉及大量文本数据的机器学习任务.我想在训练文本中识别并提取名词短语,以便稍后在管道中将其用于特征构建. 我已经从文本中提取了我想要的名词短语的类型,但是我对NLTK还是很陌生,所以我以一种可以分解列表理解的每一步的方式来解决这个问题,如下所示.
I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.
但是我真正的问题是,我在这里重塑车轮吗?有没有一种我看不到的更快的方法?
But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
将我们最终得到的元组列表的列表放宽到 只是元组列表的列表
flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples
leaves = [tupls for sublists in leaves for tupls in sublists]
将提取的术语加入一个二元组
Join the extracted terms into one bigram
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
推荐答案
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
使用ne_chunk
和
[代码]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[输出]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
要使用自定义RegexpParser
:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[输出]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
这篇关于Python(NLTK)-提取名词短语的更有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!