如何训练一个模型,该模型将导致两个新闻标题之间的相似度得分? [英] How to train a model that will result in the similarity score between two news titles?

查看:151
本文介绍了如何训练一个模型,该模型将导致两个新闻标题之间的相似度得分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个假新闻分类器,并且在这个领域我还很新.我有一列"title_1_en",其中包含虚假新闻的标题,另一列名为"title_2_en".有3个目标标签;如果"title_2_en"列中的新闻标题与第一栏中的新闻标题相同,不同意或无关,则同意",不同意"和无关".

I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.

在将句子中的单词转换为向量之后,我尝试计算两个标题之间的基本余弦相似度.这导致了余弦相似性评分,但是由于根本没有考虑同义词和语义关系,因此需要大量改进.

I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.

def L2(vector):
    norm_value = np.linalg.norm(vector)
    return norm_value

def Cosine(fr1, fr2):
    cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
    return cos

推荐答案

这里最重要的是如何将两个句子转换为向量.有多种方法可以做到这一点,最幼稚的方法是:

The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:

  • 将每个单词转换为向量-可以使用标准的预训练向量(例如word2vec或GloVe)来完成.
  • 现在每个句子都只是一袋单词向量.需要将其转换为单个矢量,即,将完整的句子文本映射到矢量.也有很多方法可以做到这一点.首先,只取句子中向量包的平均值即可.
  • 计算两个句子向量之间的余弦相似度.

Spacy的相似性是开始进行平均技术的好地方.从文档中:

Spacy's similarity is a good place to start which does the averaging technique. From the docs:

默认情况下,spaCy使用矢量平均值算法,使用 预训练向量(如果有)(例如en_core_web_lg模型).如果 不使用doc.tensor属性,该属性由 标记器,解析器和实体识别器.这就是en_core_web_sm 模型提供了相似之处.通常基于.tensor的相似之处 将会更具结构性,而单词vector的相似性将是 更多话题.您还可以自定义.similarity()方法,以 提供您自己的相似性功能,可以使用以下功能进行训练 监督技术.

By default, spaCy uses an average-of-vectors algorithm, using pre-trained vectors if available (e.g. the en_core_web_lg model). If not, the doc.tensor attribute is used, which is produced by the tagger, parser and entity recognizer. This is how the en_core_web_sm model provides similarities. Usually the .tensor-based similarities will be more structural, while the word vector similarities will be more topical. You can also customize the .similarity() method, to provide your own similarity function, which can be trained using supervised techniques.

这篇关于如何训练一个模型,该模型将导致两个新闻标题之间的相似度得分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆