获取指定单词的WordNet域名 [英] Get WordNet's domain name for the specified word
问题描述
我知道WordNet具有域层次结构:例如运动->足球.
I know WordNet has Domains Hierarchy: e.g. sport->football.
1)是否可以列出与"sport-> football"子域相关的所有单词?
1) Is it possible to list all words related, for example, to the 'sport->football' sub-domain?
Response: goalkeeper, forward, penalty, ball, field, stadium, referee and so on.
2)获取给定单词的域名,例如守门员"?
2) Get domain's name for a given word , e.g. 'goalkeeper'?
Need something like [sport->football; sport->hockey] or [football;hockey] or just 'football'.
它用于文档分类任务.
推荐答案
WordNet具有上位/下位层次结构,但这不是您想要的,因为您 可以看到您何时查找守门员:
WordNet has a hypernym / hyponym hierarchy but that is not what you want here, as you can see when you look up goalkeeper:
from nltk.corpus import wordnet
s = wordnet.synsets('goalkeeper')[0]
s.hypernym_paths()
其中一个结果是:
[Synset('entity.n.01'),
Synset('physical_entity.n.01'),
Synset('causal_agent.n.01'),
Synset('person.n.01'),
Synset('contestant.n.01'),
Synset('athlete.n.01'),
Synset('soccer_player.n.01'),
Synset('goalkeeper.n.01')]
有两种方法,分别称为usage_domains()
和topic_domains()
,但是对于大多数单词,它们返回一个空列表:
There are two methods called usage_domains()
and topic_domains()
but they return an empty list for most words:
s = wordnet.synsets('football')[0]
s.topic_domains()
>>> []
s.usage_domains()
>>> []
WordNet域项目可能就是您想要的.它提供了一个文本文件,其中包含Princeton WordNet 2.0同义词集及其对应域之间的映射.您必须注册您的电子邮件地址才能访问数据.
然后,您可以读取与您的WordNet版本相对应的文件(它们提供2.0和3.2),例如,使用anydbm
模块:
The WordNet Domains project however could be what you are looking for. It offers a text file that contains the mapping between Princeton WordNet 2.0 synsets and their corresponding domains. You have to register your email address to get access to the data.
Then you can read in the file that corresponds to your WordNet version (they offer 2.0 and 3.2), for example with the anydbm
module:
import anydbm
fh = open('wn-domains-2.0-20050210', 'r')
dbdomains = anydbm.open('dbdomains', 'c')
for line in fh:
offset, domain = line.split('\t')
dbdomains[offset[:-2]] = domain
fh.close()
然后,您可以使用同义词集的offset属性来找出其域.也许您必须在开头添加零:
You can then use the offset attribute of a synset to find out its domain. Maybe you have to add a zero at the beginning:
dbdomains.get('0' + str(wordnet.synsets('travel_guidebook')[0].offset))
>>> 'linguistics\n'
这篇关于获取指定单词的WordNet域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!