使用 NLTK 或类似方法将名词分类为抽象或具体 [英] Classify a noun into abstract or concrete using NLTK or similar

查看:40
本文介绍了使用 NLTK 或类似方法将名词分类为抽象或具体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 Python 中将名词列表分类为抽象的或具体的?

How can I categorize a list of nouns into abstract or concrete in Python?

例如:

"Have a seat in that chair."

在上面的句子中chair是名词,可以归类为具体的.

In above sentence chair is noun and can be categorized as concrete.

推荐答案

我建议使用预训练的词向量训练分类器.

I would suggest training a classifier using pretrained word vectors.

您需要两个库:spacy 用于标记文本和提取词向量,scikit-learn 用于机器学习:

You need two libraries: spacy for tokenizing text and extracting word vectors, and scikit-learn for machine learning:

import spacy
from sklearn.linear_model import LogisticRegression
import numpy as np
nlp = spacy.load("en_core_web_md")

区分具体名词和抽象名词是一项简单的任务,因此您可以用很少的例子训练一个模型:

Distinguishing concrete and abstract nouns is a simple task, so you can train a model with very few examples:

classes = ['concrete', 'abstract']
# todo: add more examples
train_set = [
    ['apple', 'owl', 'house'],
    ['agony', 'knowledge', 'process'],
]
X = np.stack([list(nlp(w))[0].vector for part in train_set for w in part])
y = [label for label, part in enumerate(train_set) for _ in part]
classifier = LogisticRegression(C=0.1, class_weight='balanced').fit(X, y)

当您拥有经过训练的模型后,您可以将其应用于任何文本:

When you have a trained model, you can apply it to any text:

for token in nlp("Have a seat in that chair with comfort and drink some juice to soothe your thirst."):
    if token.pos_ == 'NOUN':
        print(token, classes[classifier.predict([token.vector])[0]])

结果看起来令人满意:

# seat concrete
# chair concrete
# comfort abstract
# juice concrete
# thirst abstract

您可以通过将模型应用于不同的名词、发现错误并将它们添加到正确标签下的训练集中来改进模型.

You can improve the model by applying it to different nouns, spotting the errors and adding them to the training set under the correct label.

这篇关于使用 NLTK 或类似方法将名词分类为抽象或具体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆