字节vs字符vs单词-n克的粒度是多少? [英] Bytes vs Characters vs Words - which granularity for n-grams?

查看:79
本文介绍了字节vs字符vs单词-n克的粒度是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以考虑至少三种类型的n-gram用于表示文本文档:

At least 3 types of n-grams can be considered for representing text documents:

  • 字节级n-gram
  • 字符级n-gram
  • 单词级n-gram

对于我来说,目前尚不清楚应将哪个任务用于给定任务(聚类,分类等).我在某处读到,当文本包含拼写错误时,字符级n-gram优于单词级n-gram,因此玛丽爱犬"仍然类似于玛丽lpves狗".

It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".

在选择正确的"表示形式时还需要考虑其他条件吗?

Are there other criteria to consider for choosing the "right" representation?

推荐答案

评估.选择表示形式的标准是一切正常.

Evaluate. The criterion for choosing the representation is whatever works.

实际上,字符级别(!=字节,除非您只关心英语)可能是最常见的表示形式,因为它对拼写差异(如果您查看历史记录,拼写更改,不必为错误)具有鲁棒性).因此,为了进行拼写更正,效果很好.

Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.

另一方面, Google图书n-gram 查看器在其图书上使用单词级n-gram语料库.因为他们不想分析拼写,而是随着时间的推移使用术语;例如儿童保育",其中每个单词都不如它们的组合那么有趣.事实证明,这在机器翻译中非常有用,通常被称为冰箱磁铁模型".

On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".

如果您不使用国际语言,字节也可能有意义.

If you are not processing international language, bytes may be meaningful, too.

这篇关于字节vs字符vs单词-n克的粒度是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆