使用Python从基于主题的文本中提取关键短语 [英] Extracting Key-Phrases from text based on the Topic with Python

查看:59
本文介绍了使用Python从基于主题的文本中提取关键短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含3列的大型数据集,列是文本,短语和主题.我想找到一种基于该主题提取关键短语(短语列)的方法.关键字短语可以是文本值的一部分,也可以是整个文本值.

 将pandas导入为pd文字= [两支球队都有很多惊人的进球的伟大比赛",两支球队的守门员都犯了错",他赢得了全部四个大满贯冠军芯片",三分线最佳球员",德约科维奇(Novak Djokovic)是有史以来最好的球员",出色球员的扣篮惊人",他应该为此犯规获得黄牌",罚球得分"]短语= [目标",守门员",大满贯冠军筹码",三分线",诺瓦克·德约科维奇",大满贯扣篮",黄牌",罚球"]主题= [足球",足球",网球",篮球",网球",篮球",足球",篮球"]df = pd.DataFrame({"text":text,词组":词组,"topic":topic})打印(df.text)打印(df.phrase) 

我很难找到一条路径来做类似的事情,因为我的数据集中有超过50000行,大约48000个短语的唯一值以及3个不同的主题.

我猜想建立一个包含所有足球,篮球和网球主题的数据集并不是真正的最佳解决方案.所以我当时正在考虑为此制作某种ML模型,但这又意味着我将具有2个功能(文本和主题)和一个结果(短语),但是我的结果中将有48000多个不同的类,那不是一个好方法.

我当时正在考虑使用文本列作为功能并应用分类模型以查找情绪.之后,我可以使用预测的情绪来提取关键特征,但是我不知道如何提取它们.

另一个问题是,当我尝试通过将 CountVectorizer TfidfTransformer 与随机森林,决策树或其他任何分类方法进行情感分类时,我只能获得66%的准确度算法,如果Im使用 TextBlob 进行情感分析,则精度也达到66%.

有帮助吗?

解决方案

在这里看来,一种好方法是使用主题模型.


LDA 是一个无监督的模型,可以在一组观测值中找到相似的组,然后可以使用它们为每个观测值分配一个 topic .在这里,我将通过使用 text 列中的句子训练模型来研究解决此问题的方法.尽管在 phrase 足够有代表性的情况下,包含了模型需要捕获的必要信息,但是它们也可能是训练模型的一个很好的候选者(可能更好),尽管您会更好地自己判断.

在训练模型之前,您需要执行一些预处理步骤,包括对句子进行标记化,删除停用词,对词进行词根化和词干化.为此,您可以使用 nltk :

从nltk.stem中的

 导入WordNetLemmatizer从nltk.corpus导入停用词从nltk.tokenize导入word_tokenize进口LDA从sklearn.feature_extraction.text导入CountVectorizer忽略=设置(stopwords.words('english'))词干= WordNetLemmatizer()文字= []对于df.text中的句子:单词= word_tokenize(句子)茎= []逐字逐句:如果单词不在忽略中:stemmed.append(stemmer.lemmatize(word))text.append(''.join(梗)) 

现在,我们有了更合适的语料库来训练模型:

 打印(文本)[伟大的比赛,令人惊叹的进球团队",守门员队造了"四个大满贯冠军筹码",最佳球员三分线",德约科维奇最佳球员时间",惊人的灌篮最佳球员",当之无愧的黄牌犯规",罚球点"] 

然后我们可以通过 CountVectorizer ,它是输入的 LDA :

  vec = CountVectorizer(分析器='word',ngram_range =(1,1))X = vec.fit_transform(文字) 

请注意,您可以使用 ngram 参数来放大要考虑训练模型的n-gram范围.例如,通过设置 ngram_range =(1,2),您最终将获得包含所有单个单词以及每个句子中的 2-grams 的功能,下面是一个示例经过训练的 CountVectorizer ngram_range =(1,2):

  vec.get_feature_names()['惊人',惊人的目标",惊人的大满贯",'最好的',最佳球员",.... 

使用 n-grams 的优势在于,您不仅可以找到单个单词,还可以找到 Key-Phrases .

然后,我们可以使用任意数量的主题训练 LDA ,在这种情况下,我将只选择 3 主题(请注意,这无关紧要使用 topics 列),您可以将其视为您提到的 Key-Phrases - 在这种情况下或 words.在这里,我将使用 lda 例如 gensim .每个主题都将与训练过的词汇表中的一组单词相关联,每个单词都有一个得分来衡量该单词在主题中的相关性.

 模型= lda.LDA(n_topics = 3,random_state = 1)model.fit(X) 

通过 topic_word _ ,我们现在可以获得与每个主题相关的这些得分.我们可以使用 argsort 对得分向量进行排序,并使用它为特征名称的向量建立索引,这可以通过 vec.get_feature_names 获得:

  topic_word = model.topic_word_vocab = vec.get_feature_names()n_top_words = 3对于我来说,枚举(topic_word)中的topic_dist:topic_words = np.array(vocab)[np.argsort(topic_dist)] [:-( n_top_words + 1):-1]print('Topic {}:{}'.format(i,''.join(topic_words)))话题 0:最佳球员得分主题一:惊人的团队大满贯主题2:黄色novak卡 


在这种情况下,打印的结果并不能代表太多,因为该模型已经使用问题中的样本进行了训练,但是通过对整个样本进行训练,您应该会看到更清晰,更有意义的 topics 语料库.

还要注意,在这个示例中,我使用了整个词汇表来训练模型.但是,在您看来,更有意义的方法是,根据已经拥有的不同 topics 将文本列分为几组,并在每个组上分别训练一个单独的模型.希望这对您如何进行有个好主意.

I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can be part of the text value or the whole text value.

import pandas as pd


text = ["great game with a lot of amazing goals from both teams",
        "goalkeepers from both teams made misteke",
        "he won all four grand slam championchips",
        "the best player from three-point line",
        "Novak Djokovic is the best player of all time",
        "amazing slam dunks from the best players",
        "he deserved yellow-card for this foul",
        "free throw points"]

phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]

topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]

df = pd.DataFrame({"text":text,
                   "phrase":phrase,
                   "topic":topic})

print(df.text)
print(df.phrase)

I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.

I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.

I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.

One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.

Any help?

解决方案

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.


A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.

Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer

ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
    words = word_tokenize(sentence)
    stemmed = []
    for word in words:
        if word not in ignore:
            stemmed.append(stemmer.lemmatize(word))
    text.append(' '.join(stemmed))

Now we have more appropriate corpus to train the model:

print(text)

['great game lot amazing goal team',
 'goalkeeper team made misteke',
 'four grand slam championchips',
 'best player three-point line',
 'Novak Djokovic best player time',
 'amazing slam dunk best player',
 'deserved yellow-card foul',
 'free throw point']

We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:

vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)

Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):

vec.get_feature_names()
['amazing',
 'amazing goal',
 'amazing slam',
 'best',
 'best player',
 ....

The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.

Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim. Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.

model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)

Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:

topic_word = model.topic_word_

vocab = vec.get_feature_names()
n_top_words = 3

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card


The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.

Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

这篇关于使用Python从基于主题的文本中提取关键短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆