训练模型以识别句子中出现的名字 [英] Training a model to identify names appearing in a sentence

查看:57
本文介绍了训练模型以识别句子中出现的名字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含约238583人的名字.名称可以包含多个单词,例如: Willie Enriquez,James J Johnson,D.J.Khaled .我的问题是,当它们出现在句子中时,要识别这些名称.我正在尝试创建一个可以识别输入内容是否为名称的机器学习模型.我的麻烦是弄清楚该模型的输入和输出.因为我有很多名字,所以我可以训练一个模型,当输入是一个名字时,该模型可以识别一个名字,但是作为该句子一部分的其他单词呢?该模型还应该能够识别不是名称的单词.假设句子中可以包含任何其他单词,那么用于此目的的理想数据集是什么?在一个随机单词上训练模型并将其标记为NonNames是否有意义?
(出现名称的整个句子不可用.用户可以绝对键入他/她想要的任何内容)

I have a dataset containing the names of about 238583 people. The names can contain more than one word for example: Willie Enriquez , James J Johnson, D.J. Khaled. My problem is to identify these names when it appears in a sentence. I am trying to create a machine learning model that can identify if the input is a name or not. My trouble is figuring the input and output of this model. Since I have a bunch of names I can train a model which can recognise a name when the input is a name, but what about the other words that are part of this sentence. The model should also be able to identify words that are not names. Assuming the sentences can have any other words in it, what would be the ideal dataset for this purpose? Does it make sense to train a model on a random bunch of words and tag it as NonNames?
(The entire sentences in which the names appear is not available. The user can type absolutely anything he/she wants)

谢谢.

推荐答案

答案的具体情况可能会根据您所使用的模型而有所不同,但是总体思路大致如下:

The specifics of the answer may vary according to which model you are using, but the general idea is more or less the following:

您正在尝试解决分类任务,正好是二进制分类任务,您想要在其中区分专有名称(假设来自示例)和其他表达式.

You are trying to solve a classification task, precisely a binary classification task where you want to distinguish between proper names (assuming from your example) from other expressions.

在最一般的情况下,模型的输入是要分类的示例的特征:您应该确定自己认为对区分此类名称有用的特征(例如,单词数,包含大写字母),每个单词都大写,包含虚线字母,包含数据集中已经包含的任何单词,等等).输出是类,对于非名称/名称,它是0/1.

The input to your model, in the most general case, are the features of the example that you want to classify: you should decide what features you think are useful to distinguish such names (e.g., number of words, contains capital letter, every word is capitalized, contains dotted letters, contains any word that you already have in your dataset, etc...). The output is the class, that is 0/1 for non-names/names.

然后,使用来自您的数据集的正例和取自非名称随机词的负例(即非名称)训练模型.

You then train your model with positive examples from the dataset that you have and negative examples (i.e. non-names) taken from random words for non-names.

如果使用者可以输入完整的句子,那么您将需要执行一个预处理步骤,在其中提取长度为N(单词n-gram)的所有序列,并使用您先前训练的模型分别对每个序列进行分类.

If the use can enter full sentences then you will need to do a preprocessing step where you extract all sequences of length N (word n-grams) and classify each of them individually with your previously trained model.

这篇关于训练模型以识别句子中出现的名字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆