如何处理 NLP 中的两种实体提取方法 [英] How to handle two entity extraction methods in NLP
问题描述
我使用了两种不同的实体提取方法(https://rasa.com/docs/nlu/entities/) 同时在 RASA 框架中构建我的 NLP 模型以构建聊天机器人.机器人应该处理不同的问题,这些问题具有自定义实体以及一些一般问题,如位置或组织.所以我使用两个组件 ner_spacy 和 ner_crf 来创建模型.之后我在 python 中构建了一个小的帮助脚本来评估模型性能.在那里我注意到模型很难选择正确的实体.
I am using two different entity extraction methods (https://rasa.com/docs/nlu/entities/) while building my NLP model in the RASA framework to build a chatbot. The bot should handle different questions which have custom entities as well as some general ones like location or organisation. So I use both components ner_spacy and ner_crf to create the model. After that I build a small helper script in python to evaluate the model performance. There I noticed that the model struggles to choose the correct enity.
例如,对于单词X",它选择了来自 SpaCy 的预定义实体ORG",但它应该被识别为我在训练数据中定义的自定义实体.
For example for a word 'X' it choosed the pre-defined enity 'ORG' from SpaCy, but it should be recogniced as a custom enity which I defined in the training data.
如果我只使用 ner_crf 提取器,我在识别大写等位置实体时会面临巨大的问题.我最大的问题之一是单一答案实体.
If I just use the ner_crf extractor I face huge problems in identifing location enities like capitals. Also one of my biggest problems are single answer enities.
问:你最喜欢的动物是什么?"
Q : "What´s your favourite animal?"
A:狗
我的模型无法为这个单一的答案提取这个单一的实体动物".如果我用狗"这样的两个词来回答这个问题,模型就可以毫无问题地提取值为狗"的动物实体.
My model is not able to extract this single entity 'animal' for this single answer. If I answer this question with two words like 'The Dog', the model has no problems to extract the animal entity with the value 'Dog'.
所以我的问题是,使用两个不同的组件来提取实体是否聪明?一种用于自定义实体,另一种用于预定义实体.如果我使用两种方法,使用提取器的模型中的机制是什么?
So my question is, is it clever to use two different components to extract entities? One for custom enities and the other one for pre-defined enities. If I use two methods, what´s the mechanism in the model which extractor is used?
顺便说一下,目前我只是在测试,所以我的训练样本并没有应有的那么大(少于 100 个示例).如果我有更多的训练示例,问题是否能解决?
By the way, currently I´m just testing things out, so my training samples are not that huge it should be (less then 100 examples). Could the problem been solved if I have much more training examples?
推荐答案
您在这里面临 2 个问题.我提出了一些我认为有用的方法.
You are facing 2 problems here. I am suggesting few ways that i found helpful.
1.自定义实体识别:为了解决这个问题,您需要添加更多具有所有可能长度的实体的训练句子.当实体周围有可识别的标记(例如介词)时,ner_crf
会更好地预测
1. Custom entity recognition:
To solve this you need to add more training sentences with all possible lengths of entities. ner_crf
is going to predict better when there are identifiable markers around entities (e.g. prepositions)
2.从单个单词答案中提取实体:作为一种解决方法,我建议您在客户端进行以下操作.
2. Extracting entities from single word answer : As a workaround, i suggest you to do below manipulations on client end.
当您发送诸如 您最喜欢的动物是什么?
之类的问题时,请在问题后附加一个标记,以向客户表明需要一个单一的答案.例如您可以将##SINGLE##您最喜欢的动物是什么?
发送给客户.
When you are sending question like
What´s your favorite animal?
, append a marker to question to indicate to client that a single answer is expected. e.g. You can send##SINGLE## What´s your favorite animal?
to client.
客户可以从问题中删除 ##SINGLE##
并将其显示给用户.但是当客户端将用户的响应发送到服务器时,它不会发送 Dog
,而是发送类似 User Responding with single answer as Dog
Client can remove the ##SINGLE##
from question and show it to user. But when client sends user's response to server, it doesn't send Dog
, it send something like User responded with single answer as Dog
您可以训练您的模型从这样的答案中提取实体.
You can train your model to extract entities from such an answer.
这篇关于如何处理 NLP 中的两种实体提取方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!