为什么 FastText 不处理查找多词短语? [英] Why FastText is not handling finding multi-word phrases?
问题描述
FastText 预训练模型非常适合查找相似词:
FastText pre-trained model works great for finding similar words:
from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)
[('dogs', 0.8463464975357056),
('puppy', 0.7873005270957947),
('pup', 0.7692237496376038),
('canine', 0.7435278296470642),
...
但是,对于多词短语,它似乎失败了,例如:
However, it seems to fail for multi-word phrases, e.g.:
model.nearest_neighbors('Gone with the Wind', k=2000)
[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
0.71047443151474),
或
model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
0.5197194218635559),
这是 FastText 预训练模型的限制吗?
Is it a limitation of FastText pre-trained models?
推荐答案
如文档,官方的 fastText 无监督嵌入是在标记化阶段之后构建的,其中单词是分开的.
As described in the documentation, official fastText unsupervised embeddings are built after a phase of tokenization, in which the words are separated.
如果您查看模型词汇表(官方 python 绑定中的model.words
),您将找不到包含空格的多词短语.
If you look at your model vocabulary (model.words
in the official python binding), you won't find multi-word phrases containing spaces.
因此,正如 gojomo 所指出的,生成的向量是合成的、人工的和有噪声的;您可以从查询结果中推断出来.
Therefore, as pointed out by gojomo, the generated vectors are synthetic, artificial and noisy; you can deduce it from the result of your queries.
本质上,fastText 官方嵌入不适合这个任务.根据我的经验,这与使用的版本/wapper 无关.
In essence, fastText official embeddings are not suitable for this task. In my experience this does not depend on the version / wapper used.
这篇关于为什么 FastText 不处理查找多词短语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!