根据实例学习用关键字标记句子 [英] Learning to tag sentences with keywords based on examples

查看:77
本文介绍了根据实例学习用关键字标记句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组(约50k个元素)小文本片段(通常是一两个句子),每个片段都标记有一组关键词,这些关键词是从约5000个单词列表中选择的.

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.

我将如何实施一个系统,从该示例中学习,然后可以使用相同的关键字集标记新句子?我不需要代码,我只是在寻找一些有关如何实现此目标的指针和方法/论文/可能的想法.

How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.

推荐答案

如果我对您的理解很好,那么您需要的是衡量一对文档的相似度.我最近一直在使用 TF-IDF 对文档进行聚类,并且它可以正常工作安静好.我认为您可以在这里使用TF-IDF值,并为相应的TF-IDF计算余弦相似度每个文档的值.

If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.

  1. TF-IDF计算

TF-IDF代表Term Frequency - Inverse Document Frequency.这是一个如何计算的定义:

TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:

Compute TF-IDF values for all words in all documents                                    
    - TF-IDF score of a word W in document D is

             TF-IDF(W, D) = TF(W, D) * IDF(W) 

      where TF(W, D) is frequency of word W in document D
            IDF(W) = log(N/(2 + #W))
            N  - number of documents
            #W - number of documents that contain word W

    - words contained in the title will count twice (means more important)
    - normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.

取决于您使用的技术,这可以通过不同的方式来实现.我已经使用嵌套字典在Python中实现了它.首先,我使用文档名称D作为键,然后对于每个文档D,我都有一个嵌套的词典,其中单词W作为键,每个单词W都有一个对应的数值,该数值是计算出的TF-IDF.

Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.

  1. 相似度计算

假设您已经计算了TF-IDF值,并且想要比较两个文档W1W2的相似程度.为此,我们需要使用一些相似性度量.有很多选择,每个选择都有其优点和缺点.在这种情况下,IMO, Jaccard相似度

Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.

计算2个文档之间的相似度后,您将获得一个数值.值越大,两个文档W1W2越相似.现在,根据您要实现的目标,我们有两种方案.

After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.

  • 如果您要1个文档仅分配最相似文档的标签,则将其与所有其他文档进行比较,然后将最相似文档的标签分配给新文档.
  • 您可以设置一些阈值,并且可以分配与该文档的相似性大于阈值的所有文档标签.如果设置threshold = 0.7,则所有文档W都将具有所有已标记文档V的标签,而similarity(W, V) > 0.7会为其
  • .
  • If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
  • You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.

希望对您有帮助.

祝你好运:)

这篇关于根据实例学习用关键字标记句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆