关联字接近 [英] Correlating word proximity

查看:163
本文介绍了关联字接近的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有一个对话的文字抄本经过一段aprox的了。 1小时。我想知道发生在靠近proximatey什么话彼此。我会用什么样的统计方法来确定什么话都聚集在一起,另一个是他们的proximatey有多近呢?

我怀疑某种聚类分析和主成分分析。

解决方案

要确定字接近,你就必须建立一个图:

  1. 每个字是一个顶点(或节点),和
  2. 在左,右的话是边

所以,我喜欢狗将有2边和3个顶点。

现在,下一步将基于该模型你的定义亲密是决定。

这是统计的原因。

要确定相关的词集团

  1. MCL集群 - 这会给你一个数目算法有被看到在一起的高赔率集群

  2. K均值聚类 - 这会给你一个K字组

  3. 阈值 - 这是最可靠和最直观的方法。绘制的数据,你了解(例如,从新闻片段或文章段落您已经阅读)的一小部分的所有关系和运行方法生成的图表,并使用工具,如graphviz的或Cytoscape的可视化图表。一旦你可以看到关联,可以算多少的边缘不同的话,清楚地聚集在一起的人们通常会发现。你可能会发现,例如,这两个词聚集在一起将对每5个实例的边缘。以此为截止,写它输出字对,至少有1个边的字在你的顶点图中每5个实例自己的图形分析脚本。

    1. 将ROC曲线评估3。直到你有很少的集群,你可以滴定你的截止走高此值。如果你再对与已知的,预期的结果一个段落(用谁已经知道用什么词应报告为相关人创造)中运行你的算法,你可以使用一个接收器操作特性,其比较的评估你的算法的precision相关词输出到precalculated金标准。

Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is?

I'm suspecting some sort of cluster analysis or PCA.

解决方案

To determine word proximity, you will have to build a graph:

  1. each word is a vertex (or "node"), and
  2. left and right words are edges

So "I like dogs" would have 2 edges and 3 vertices.

Now, the next step will be to decide based on this model what your definition of "close" is.

This is where the statistics comes in.

To determine "groups" of correlated words

  1. MCL clustering - This will give you a number of clusters which algorithmically have high odds of being seen together.

  2. K MEANS clustering - This will give you "k" groups of words.

  3. Thresholding - this is the most reliable and intuitive method. Plot all the relationships for a small subset of data that you understand (for example, a paragraph from a news clip or article you have read) and run your method to generate a graph, and visualize the graph using a tool such as graphviz or cytoscape. Once you can see the relatedness, you can count how many edges are generally found between different words that clearly cluster together. You might find that, for example, two words that cluster together will have an edge for every 5 instances. Use this as a cutoff and write your own graph analysis script which outputs word-pairs that have at least 1 edge for every 5 instances of the word in your vertex graph.

    1. Evaluating 3 by ROC curves. You can titrate this value of your cutoff higher and higher until you have very few "clusters". If you then run your algorithm against a paragraph with known, expected results (created by a human who already knows what words should be reported as correlated), you can evaluate the precision of your algorithm using a receiver operating characteristic which compares the correlated-words output to a precalculated gold standard.

这篇关于关联字接近的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆