Word Clustring项目基础 [英] Word Clustring project basis

查看:79
本文介绍了Word Clustring项目基础的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑将我的毕业设计作为字项目,但我不知道在开始之前我应该​​具备哪些知识

我正在大学进行模式识别和分类课程

任何帮助?

感谢先进的

I'm thinking about my graduation project to be word clustring project but i don't know what knowledge I should have before starting
I'm taking pattern recognition & classification course at college now
any help?
thanks in advanced

推荐答案

为什么这么多人会问开始?有太多的方法可以开始,而且很多方法都足够好。这不是你要选择的那么重要。

例如,你可以从学习数学的相关部分开始: https://en.wikipedia.org/wiki/Cluster_analysis [ ^ ]。



首先,非常基本的数学问题是:要进行聚类,你应该有一些空间要集群的集合的元素,并且该空间应具有正确定义的 norm ,以创建正常空间https://en.wikipedia.org/wiki/Normed_vector_space [ ^ ]。



简单来说,你应该有(并提供给计算算法)计算的函数两个物体之间的距离(你称之为单词,在你的情况下)。这些功能可能不同;但这不是一个完全随意的功能;它应该表现出满足度量空间公理所提出的要求。只有这样,您才能将聚类分析应用于您的集合。上面引用的文章实际上只是注意点。



-SA
Why so many people ask about "starting"? There are too many ways to start, and many of those ways are good enough. This is not so critical which one you would choose.
For example, you can start with learning relevant part of mathematics: https://en.wikipedia.org/wiki/Cluster_analysis[^].

First, very elementary mathematical thing is this: to do clustering, you should have some space of elements of your set you want to cluster, and that space should have a correctly defined norm, to make a normal space: https://en.wikipedia.org/wiki/Normed_vector_space[^].

In simple words, you should have (and supply to the calculation algorithm) the function which calculates the distance between two objects (which you called words, in your case). Such functions could be different; but this is not a fully arbitrary function; it should behave to meet the requirements posed by the axioms of the metric space. Only then you can apply the cluster analysis to your set. The article referenced above is really just the staring point.

—SA


Mujeeba Haj Najeeb问:
Mujeeba Haj Najeeb asked:



首先我需要了解为什么要这样做?

和我还需要知道现实生活中的一些应用才能更具体地实现它

我需要知道我要去哪里,我需要一个普遍的视角来决定是否进入这样的领域。


first of all I need to understand why going for that at all?
and I also need to know some applications in real life to realize it more specifically
I need to know where am going, I need a general perspective to make a decision to enter in such field or not.

感谢您的澄清;很公平。



很难涵盖这种分析的任何综合应用,但一般来说,它用于某些语言学领域,特别是自然语言处理

http://en.wikipedia.org/wiki/Natural_language_processing [ ^ ]。



它可以被视为计算语言学的众多部分之一:

http://en.wikipedia.org/wiki/Computational_linguistics [ ^ ]。



参见:

http://en.wikipedia.org/wiki/Word-sense_induction [ ^ ],

http://en.wikipedia.org/wiki/Ambiguity [ ^ ]。



请注意,在上面提到的例子中,规范(距离)本身(我讨论了规范的作用)在解决方案1)中是非常复杂的:它应该反映语义相似性一个非常复杂的概念,这个概念本身很难形式化。某些单词集(词库)的范数值可以来自广泛的统计分析,专家系统等。它与大多数库中实现的字符串比较算法无关。



参见:http://en.wikipedia.org/wiki/Word-sense_disambiguation [ ^ ]。



也许有不同的应用程序,我从未听说过,只能推测。例如,我的一个朋友专门从事计算语言学,并为他的论文辩护,例如根据文本样本中的单词统计数据来推断作家的个人特征。



我必须说,计算语言学是一个正在发展的科学分支,它还没有真正接近,比如严肃的商业用途。我觉得未来主要的面包通过工作,对新人来说可能看起来很有吸引力。我相信这个领域的每一项认真工作都处于语言学和应用数学的最​​前沿(甚至可能是基础数学。如果你想要去做它(你问这个问题很好),你需要有坚实的数学背景,认真地进入语言学,这真的很难。我不认为只是一个程序员,技术的一部分,可能是实际的或合理的。进入真正的科学是唯一的是有道理的,但这并不适合所有人,所以尽量做到真实。我希望你不要把我的话当作沮丧。我会很高兴知道你走这条路并且成功了。



-SA

Thank you for your clarification; fair enough.

It's hard to cover any comprehensive applications of this kind of analysis, but, generally, it is used in some fields of linguistics, in particular, natural language processing:
http://en.wikipedia.org/wiki/Natural_language_processing[^].

It can be considered as one of the many parts of computational linguistics:
http://en.wikipedia.org/wiki/Computational_linguistics[^].

See also:
http://en.wikipedia.org/wiki/Word-sense_induction[^],
http://en.wikipedia.org/wiki/Ambiguity[^].

Note that in the examples mentioned above, the norm (distance) itself (I discussed the role of norm in Solution 1) is extremely complex: it should reflect semantic similarity a very complex notion which is itself very hard to formalize. The norm values for some word set (thesaurus) can come from extensive statistical analysis, expert systems, and the like. It has nothing to do with the string comparison algorithms implemented in most libraries.

See also: http://en.wikipedia.org/wiki/Word-sense_disambiguation[^].

Maybe there are different applications which I never heard of and could only speculate about. For example, one of my friends specialized in computational linguistic and defended his dissertation on such thing as inferring individual characteristics of a writer based exclusively on statistics of the words found in the text samples.

I must say that computational linguistics is a developing branch of science which is not yet really close to, say, serious commercial use. I feel that the major bread-through works lie in future, which might look attractive to the newcomers. I believe every serious work in this field is on the cutting edge of both linguistics and applied mathematics (and maybe even "fundamental" mathematics. If you want to go for it (and it's good that you asked this question), you need to have solid mathematical background and seriously go into linguistic, which is really hard to do. I don't think that being "just a programmer", a part of technical stuff, can be practical or reasonable. Getting into real science is the only thing which makes sense, but this is not for everyone, so try to be realistic. I hope you won't consider my words as discouragement. I would be more than happy to know that you take this route and are successful.

—SA


这篇关于Word Clustring项目基础的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆