如何使用WordNet或与Wordnet相关的文本实现基于类别的文本标记? [英] How to implement category based text tagging using WordNet or related to wordnet?

查看:95
本文介绍了如何使用WordNet或与Wordnet相关的文本实现基于类别的文本标记?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用wordnet按单词的类别(java作为接口)来标记文本?

示例

考虑句子:

1)计算机需要键盘,监视器,CPU才能工作.
2)汽车使用齿轮和离合器.

现在我的目标是,例句必须标记为

  • 第一句话

计算机/电子产品
键盘/电子
CPU/电子

  • 第二句话

    汽车/机械
    齿轮/机械
    离合器/机械

一些额外的例子...

使用微芯片监控离合器和齿轮"->离合器/机械,齿轮/机械,微芯片/电子

此处用于监控氢水平的软件"->软件/计算机,氢/化学..

我想在Java中实现上述目标,即通过与之相关的类别来标记名词,例如技术,机械,电气等.

如何使用wordnet做到这一点.

我以前的作品

为了实现我的目标,我在文本文件中为每个类别创建了一个术语索引,并将其与标题匹配.如果文本文件中包含单词,则标题将被分类.

例如

Automobile.txt具有car , gear , wheel , clutch.
networking.txtserver,IP Address,TCP , RIP

这是算法:

String Classify (String title)
{
 String area;
 if (compareWordsFrom ("Automobile.txt",title) == true ) area = "Auto";
 if (compareWordsFrom ("Netoworking.txt",title) == true ) area = "Networking";
 if (compareWordsFrom ("metels.txt",title) == true ) area = "Metallurgy";
 return area;
}

很难找到相关的词来建立索引.也就是说,野外汽车有1000个相关术语很难找到.

准确地说,手动建立术语索引是一个令人心碎的过程

我已经使用了Stanford NLP,Open NLP,但它们都在标记POS,但不能满足需要.

我的需要
我需要一种自动化的工作方式.做到自然语言处理技术能够做到这一点. ?

有人建议使用wordnet库,但由于它像字典一样,我该如何使用它,但我想使用..

机械= {齿轮,涡轮,发动机....) 电子= {microchip,RAM,ROM,...)

是否有可用的单词数据库,如上述结构..

或者我有现成的图书馆吗?

解决方案

您需要将一堆名词(例如汽车",齿轮")分类为预定义的类别(例如汽车").尽管命名实体识别是完成此任务的正确方法,但它存在一些问题,主要问题是收集足够的带注释数据以正确培训系统.

WordNet可以通过在名词之间建立 语义相似 来提供帮助,从而帮助您根据相似性得分选择类别.有几种建立相似度分数的方法.一些突出的是

基本思想是,类似的术语由一个本体(例如WordNet)归类在类似的类别下.因此,如果在本体的类别树中它们的类别紧密相关,则它们之间的距离会更短,否则,则更长.也许最简单的分数就是路径分数:

PathScore(s1, s2) = 1/pathLength(s1, s2)

其中 pathLength 是上述类别树中路径的长度.

说明:

PathScore(*car*, *automobile*) = 1.0;     // path score is always between 0 and 1
WuPalmerScore(*car*, *automobile*) = 1.0; // Wu & Palmer's score is always between 0 and 1

PathScore(*engine*, *automobile*) = 0.25;
WuPalmerScore(*engine*, *automobile*) = 0.88;

PathScore(*microprocessor*, *automobile*) = 0.09;
WuPalmerScore(*microprocessor*, *automobile*) = 0.58;

因此,如您所见,要在同一类别中使用的字词通常会具有较高的相似性评分.最好的库是 Java的WordNet相似性 提供了几个相似性指标供您进行实验.他们也有一个在线演示.

注意事项如果您尝试标记专有名词,WordNet的效果将不佳.例如,如果您希望 Hyundai 属于汽车类别,而 Samsung 属于电子类别,那么这根本没有帮助...只是因为WordNet不进行分类这些名词.在WordNet之上还构建了其他本体,可以在这种情况下为您提供帮助:

  • 这样的著名本体是 Yago .
  • 使用Wikipedia类别是另一种成功的方法.

How to tag text using wordnet by word's category (java as a interfacer ) ?

Example

Consider the sentences:

1) Computers need keyboard , moniter , CPU to work.
2) Automobile uses gears and clutch .

Now my objective is , the example sentences have to be tagged as

  • 1st sentence

Computer/electronic
keyboard/electronic
CPU / electronic

  • 2nd sentence

    Automobile / mechanical
    gears / mechanical
    clutch / mechanical

some extra example ...

"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic

"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..

I want to implement above mentions objective in java, that is to tag nouns by it related category such as technical , mechanical , electrical etc.

How to do this using wordnet .

My Previous Works

To achieve my objective I created a index of terms in text files for each category and matched it with a title .. if it contains a word in text files , then title get classified.

For example

Automobile.txt have car , gear , wheel , clutch.
networking.txt have server,IP Address,TCP , RIP

This is the Algorithm:

String Classify (String title)
{
 String area;
 if (compareWordsFrom ("Automobile.txt",title) == true ) area = "Auto";
 if (compareWordsFrom ("Netoworking.txt",title) == true ) area = "Networking";
 if (compareWordsFrom ("metels.txt",title) == true ) area = "Metallurgy";
 return area;
}

it is very difficult to find related words to build the index. That is , the field automobile have 1000 of related terms which difficult to find.

To be precise , building index of terms manually is a heart-breaking process

I already used Stanford NLP , Open NLP , but they are tagging POS , but not satisfying what is need.

My Need
I need an automated way for my work . Do Natural Language Processing techniques able to do it. ?

Some suggesting to use wordnet library , but how can I use it since it is like dictionary , but I wants like ..

mechanical = {gear , turbine , engine ....) electronic = {microchip , RAM , ROM ,...)

Is there any word database available like in above mentioned structure ..

OR I is there is an ready-made library available ?

解决方案

You need to categorize a bunch of nouns (e.g. "car", "gear") into predefined categories (e.g. "automobile"). Although named-entity recognition is the proper way of getting this done, it has its issues, the main one being gathering enough annotated data for training the system properly.

WordNet can help by establishing semantic similarity between nouns, thereby helping you select categories based on similarity scores. There are several ways of establishing similarity scores. Some prominent ones are

The basic idea is that similar terms are grouped under similar categories by an ontology (such as WordNet). Therefore, the distance between their categories in the category tree of the ontology will be shorter if they are closely related, and longer otherwise. Perhaps the simplest such score is the path-score:

PathScore(s1, s2) = 1/pathLength(s1, s2)

where pathLength is the length of the path in the aforementioned category tree.

To illustrate:

PathScore(*car*, *automobile*) = 1.0;     // path score is always between 0 and 1
WuPalmerScore(*car*, *automobile*) = 1.0; // Wu & Palmer's score is always between 0 and 1

PathScore(*engine*, *automobile*) = 0.25;
WuPalmerScore(*engine*, *automobile*) = 0.88;

PathScore(*microprocessor*, *automobile*) = 0.09;
WuPalmerScore(*microprocessor*, *automobile*) = 0.58;

So, as you can see, terms that you want in the same category will usually have higher similarity scores. The best library for doing this is WordNet Similarity for Java, which offers several similarity metrics for you to experiment with. They also have an online demo here.

Caveat WordNet will not perform well if you are trying to label proper nouns. For example, if you want Hyundai to be in the automobile category and Samsung in the electronics category, this won't help at all ... simply because WordNet does not categorize these nouns. There are other ontologies built on top of WordNet that may help you in this scenario:

  • One such well-known ontology is Yago.
  • Using Wikipedia categories is another successful approach.

这篇关于如何使用WordNet或与Wordnet相关的文本实现基于类别的文本标记?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆