为什么 FastText 不处理查找多词短语? [英] Why FastText is not handling finding multi-word phrases?

查看:69
本文介绍了为什么 FastText 不处理查找多词短语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

FastText 预训练模型非常适合查找相似词:

FastText pre-trained model works great for finding similar words:

from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)

[('dogs', 0.8463464975357056),
 ('puppy', 0.7873005270957947),
 ('pup', 0.7692237496376038),
 ('canine', 0.7435278296470642),
 ...

但是,对于多词短语,它似乎失败了,例如:

However, it seems to fail for multi-word phrases, e.g.:

model.nearest_neighbors('Gone with the Wind', k=2000)

[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
  0.71047443151474),

model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
 ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
  0.5197194218635559),

这是 FastText 预训练模型的限制吗?

Is it a limitation of FastText pre-trained models?

推荐答案

文档,官方的 fastText 无监督嵌入是在标记化阶段之后构建的,其中单词是分开的.

As described in the documentation, official fastText unsupervised embeddings are built after a phase of tokenization, in which the words are separated.

如果您查看模型词汇表(官方 python 绑定中的model.words),您将找不到包含空格的多词短语.

If you look at your model vocabulary (model.words in the official python binding), you won't find multi-word phrases containing spaces.

因此,正如 gojomo 所指出的,生成的向量是合成的、人工的和有噪声的;您可以从查询结果中推断出来.

Therefore, as pointed out by gojomo, the generated vectors are synthetic, artificial and noisy; you can deduce it from the result of your queries.

本质上,fastText 官方嵌入不适合这个任务.根据我的经验,这与使用的版本/wapper 无关.

In essence, fastText official embeddings are not suitable for this task. In my experience this does not depend on the version / wapper used.

这篇关于为什么 FastText 不处理查找多词短语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆