向量化单词时如何处理文本字符串中的数字? [英] How to treat numbers inside text strings when vectorizing words?

查看:64
本文介绍了向量化单词时如何处理文本字符串中的数字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个要向量化的文本字符串,我应该如何处理其中的数字?或者,如果我用数字和单词输入神经网络,我如何将数字保留为数字?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?

我打算制作一本我所有单词的字典(按照此处的建议).在这种情况下,所有字符串都将成为数字数组.我应该如何处理数字字符?如何输出一个不混合词索引和数字字符的向量?

I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?

将数字转换为字符串会削弱我提供给网络的信息吗?

Does converting numbers to strings weakens the information i feed the network?

推荐答案

扩大与@user1735003 的讨论 - 让我们考虑两种表示数字的方式:

Expanding your discussion with @user1735003 - Lets consider both ways of representing numbers:

  1. 将其视为字符串并将其视为另一个单词并在形成字典时为其分配ID.或者
  2. 将数字转换为实际单词:1"变为一",2"变为二",依此类推.

第二个是否改变了上下文?.为了验证它,我们可以使用 word2vec 找到两个表示的相似性.如果上下文相似,则得分会很高.

Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.

例如,1one 的相似度得分为 0.17,2two 的相似度得分为 0.23.它们似乎表明使用它们的上下文完全不同.

For example, 1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.

通过将数字视为另一个词,您并没有改变上下文但通过对这些数字进行任何其他转换,您不能保证它更好.因此,最好保持原样并将其视为另一个词.

By treating the numbers as another word, you are not changing the context but by doing any other transformation on those numbers, you can't guarantee its for better. So, its better to leave it untouched and treat it as another word.

注意:word-2-vecglove 都是通过将数字视为字符串来训练的(情况 1).

Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).

这篇关于向量化单词时如何处理文本字符串中的数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆