根据实例学习用关键字标记句子 [英] Learning to tag sentences with keywords based on examples
问题描述
我有一组(约50k个元素)小文本片段(通常是一两个句子),每个片段都标记有一组关键词,这些关键词是从约5000个单词列表中选择的.
I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
我将如何实施一个系统,从该示例中学习,然后可以使用相同的关键字集标记新句子?我不需要代码,我只是在寻找一些有关如何实现此目标的指针和方法/论文/可能的想法.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
推荐答案
如果我对您的理解很好,那么您需要的是衡量一对文档的相似度.我最近一直在使用 TF-IDF 对文档进行聚类,并且它可以正常工作安静好.我认为您可以在这里使用TF-IDF值,并为相应的TF-IDF计算余弦相似度每个文档的值.
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
- TF-IDF计算
TF-IDF
代表Term Frequency - Inverse Document Frequency
.这是一个如何计算的定义:
TF-IDF
stands for Term Frequency - Inverse Document Frequency
. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
取决于您使用的技术,这可以通过不同的方式来实现.我已经使用嵌套字典在Python中实现了它.首先,我使用文档名称D
作为键,然后对于每个文档D
,我都有一个嵌套的词典,其中单词W
作为键,每个单词W都有一个对应的数值,该数值是计算出的TF-IDF
.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D
as key and then for each document D
I have a nested dictionary with word W
as key and each word W has a corresponding numeric value which is the calculated TF-IDF
.
- 相似度计算
假设您已经计算了TF-IDF
值,并且想要比较两个文档W1
和W2
的相似程度.为此,我们需要使用一些相似性度量.有很多选择,每个选择都有其优点和缺点.在这种情况下,IMO, Jaccard相似度和
Let say You have calculated the TF-IDF
values already and You want to compare 2 documents W1
and W2
how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF
and names of the 2 documentsW1
and W2
as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
计算2个文档之间的相似度后,您将获得一个数值.值越大,两个文档W1
和W2
越相似.现在,根据您要实现的目标,我们有两种方案.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1
and W2
are. Now, depending on what You want to achieve, we have 2 scenarios.
- 如果您要1个文档仅分配最相似文档的标签,则将其与所有其他文档进行比较,然后将最相似文档的标签分配给新文档.
- 您可以设置一些阈值,并且可以分配与该文档的相似性大于阈值的所有文档标签.如果设置
threshold = 0.7
,则所有文档W都将具有所有已标记文档V
的标签,而similarity(W, V) > 0.7
会为其 .
- If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
- You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set
threshold = 0.7
, than all document W will have the tags of all already tagged documentsV
for whichsimilarity(W, V) > 0.7
.
希望对您有帮助.
祝你好运:)
这篇关于根据实例学习用关键字标记句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!