如何将对话数据分解为成对的 (Context , Response) [英] How to break conversation data into pairs of (Context , Response)

查看:21
本文介绍了如何将对话数据分解为成对的 (Context , Response)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Gensim Doc2Vec 模型,尝试对客户支持对话的部分内容进行聚类.我的目标是向支持团队提供自动回复建议.

I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions.

图 1: 显示了一个示例对话,其中用户问题在下一个对话行中得到回答,从而可以轻松提取数据:

Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data:

在对话中应该建议你好"我们的办公室位于纽约"

图 2: 描述了一个问题和答案不同步的对话

Figure 2: describes a conversation where the questions and answers are not in sync

在对话中应该建议你好"我们的办公室位于纽约"

图 3: 描述了一个对话,其中答案的上下文随着时间的推移而构建,并且出于分类目的(我假设)一些行是多余的.

Figure 3: describes a conversation where the context for the answer is built over time, and for classification purpose (I'm assuming) some of the lines are redundant.

在对话中这里是免费试用帐户的链接"应该被建议

during the conversation "here is a link for the free trial account" should be suggested

我的每条对话行都有以下数据(简化版):
谁写的线路(用户或代理),文本,时间戳

I have the following data per conversation line (simplified):
who wrote the line (user or agent), text, time stamp

我正在使用以下代码来训练我的模型:

I'm using the following code to train my model:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
import datetime

print('Creating documents',datetime.datetime.now().time())
context = TaggedLineDocument('./test_data/context.csv')

print('Building model',datetime.datetime.now().time())

model = Doc2Vec(context,size = 200, window = 10, min_count = 10, workers=4)
print('Training...',datetime.datetime.now().time())

for epoch in range(10):
    print('Run number :',epoch)
    model.train(context)

model.save('./test_data/model')

:我应该如何构建我的训练数据以及可以应用哪些启发式方法来从原始数据中提取它?

Q: How should I structure my training data and what heuristics could be applied in order to extract it from the raw data?

推荐答案

为了训练模型,我会从连接连续的消息序列开始.我会做的是,使用时间戳,连接消息,而没有来自其他实体的任何消息.

To train a model I would start by concatenating consecutive sequences of messages. What I would do is, using the timestamps, concatenate the messages without any message in between from the other entity.

例如:

Hello
I have a problem
I cannot install software X
                                       Hi
                                       What error do you get?

应该是:

Hello I have a problem I cannot install software X
                                       Hi What error do you get?

然后我会用这种格式的句子训练一个模型.我会这样做,因为我假设对话在来自实体的交互之间始终具有单一主题".在那种情况下,建议一条消息 Hi What error do you get? 完全没问题.

Then I would train a model with sentences in that format. I would do that because I am assuming that the conversations have a "single topic" all the time between interactions from the entities. And in that scenario suggesting a single message Hi What error do you get? would be totally fine.

另外,看看数据.如果用户的问题通常是单句(如示例中所示),则句子检测可能会有很大帮助.在这种情况下,我会对连接的字符串应用句子检测(nltk 可以是一个选项)并且只使用单句问题进行训练.这样,您可以在以减小数据集大小为代价训练模型时避免不同步问题.

Also, take a look at the data. If the questions from the users are usually single-sentenced (as in the examples) sentence detection could help a lot. In that case I would apply sentence detection on the concatenated strings (nltk could be an option) and use only single-sentenced questions for training. That way you can avoid the out-of-sync problem when training the model at the price of reducing the size of the dataset.

另一方面,我真的会考虑从一个非常简单的方法开始.例如,您可以通过 tf-idf 对问题进行评分,为了获得建议,您可以采用数据集中最相似的问题,并使用一些指标(例如余弦相似度)并建议该问题的答案.这在带有上下文信息的句子中表现非常糟糕(例如 你是怎么做的?),但在像 你在哪里?这样的句子中表现很好.

On the other hand, I would really consider to start with a very simple method. For example you could score questions by tf-idf and, to get a suggestion, you can take the most similar question in your dataset wrt some metric (e.g. cosine similarity) and suggest the answer for that question. That will perform very bad in sentences with context information (e.g. how do you do it?) but can perform well in sentences like where are you based?.

我的最后一个建议是因为当数据集较小时,传统方法的性能甚至比复杂的神经网络方法还要好.你的数据集有多大?

My last suggestion is because traditional methods perform even better than complex NN methods when the dataset is small. How big is your dataset?

如何训练 NN 方法也很关键,有很多超参数,正确调整它们可能很困难,这就是为什么使用简单方法建立基线可以帮助您很多来检查你做得如何.在另一篇论文中,他们比较了 doc2vec 的不同超参数,也许你觉得有用.

How you train a NN method is also crucial, there are a lot of hyper-parameters, and tuning them properly can be difficult, that's why having a baseline with a simple method can help you a lot to check how well you are doing. In this other paper they compare the different hyper-parameters for doc2vec, maybe you find it useful.

编辑:一个完全不同的选择是训练模型将问题与答案联系起来".但是为此,您应该用相应的答案手动标记每个问题,然后在该数据上训练监督学习模型.这可能会更好地概括,但增加了手动标记句子的工作量,但对我来说这仍然不是一个简单的问题.

a completely different option would be to train a model to "link" questions with answers. But for that you should manually tag each question with the corresponding answer and then train a supervised learning model on that data. That could potentially generalize better but with the added effort of manually labelling the sentences and still it doesn't look like an easy problem to me.

这篇关于如何将对话数据分解为成对的 (Context , Response)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆