如何使用语义自动标记单词簇? [英] How to automatically label a cluster of words using semantics?

查看:70
本文介绍了如何使用语义自动标记单词簇?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文是:我已经有一些词(实际上是短语),这些词是来自应用于互联网搜索查询的kmeans所产生的,并且在搜索引擎的结果中使用了常见的网址作为距离(如果我使用url而不是单词,则是同时出现的)简化很多.)

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).

我想使用语义自动标记聚类,换句话说,我想提取围绕在一起考虑的一组短语的主要概念.

I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.

例如-对我的例子感到抱歉-如果我有以下一堆查询:['我的丈夫袭击了我','他被警察逮捕了','审判仍在进行',"我丈夫会因为骚扰我而入狱?','自由律师'] 我的研究涉及家庭暴力,但是很明显,这个类别集中在问题的法律方面,因此标签可以是合法的".

For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer'] My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.

我是NPL的新手,但我必须指出,我不想使用POS标记提取单词(或者至少这不是预期的最终结果,但可能是必要的初步步骤).

I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).

我阅读了有关Wordnet消除语义歧义的文章,我认为这可能是一个不错的选择,但是我不想计算两个查询之间的相似度(因为输入的是聚类),也不想获得一个所选单词的定义,这要归功于整个单词(在这种情况下要选择哪个单词)提供的上下文.我想用整个单词来提供上下文(可能使用同义词集或对wordnet的xml结构进行分类),然后用一个或几个单词来概括上下文.

I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.

有什么想法吗?我可以使用R或python,虽然我对nltk有所了解,但找不到在我的上下文中使用它的方法.

Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

推荐答案

您最好的选择是手动标记集群,尤其是在集群很少的情况下.甚至对于人类来说,这也是一个难题,因为您可能需要领域专家.任何声称他们可以自动且可靠地做到这一点的人(在某些非常有限的域中除外)可能是在运行一家初创公司并试图获得您的业务.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.

此外,亲自遍历集群也会有好处. 1)您可能会发现簇的数量错误(k参数),或者开始时输入中的垃圾太多. 2)您将获得定性的洞察力,以了解正在谈论的内容以及数据中包含的主题(在查看数据之前您可能不知道这些话题).因此,如果您要获得定性见解,请手动标记.如果您还需要定量结果,则可以在手动标记的主题上训练分类器,以:1)预测其余集群的主题,或2)供将来使用,如果您重复进行聚类,获取新数据,...

Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

这篇关于如何使用语义自动标记单词簇?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆