将nl字符串转换为向量或某些数字等效项 [英] Convert nl string to vector or some numeric equivalent

查看:131
本文介绍了将nl字符串转换为向量或某些数字等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将字符串转换为等效的数字,以便可以训练神经网络对字符串进行分类.我尝试了ascii值的总和,但这只会导致数字较大而数字较小.

I'm trying to convert a string to a numeric equivalent so I can train a neural-network to classify the strings. I tried the sum of the ascii values, but that just results in larger numbers vs smaller numbers.

例如,我可以在德语中输入一个短字符串,并将其放入英语课中,因为它所使用的英语单词简短且在数值上很小.

For example, I could have a short string in german and it puts it into the english class because the english words that it has been trained with are short and numerically small.

我正在研究Google的word2vec,看来它应该可以工作.但是我想在客户端执行此操作.我在这里找到了 node.js实现,但这只是运行命令行工具.

I was looking into Google's word2vec, which seems like it should work. But I want to do this on the client-side. And I found a node.js implementation, here, but that just runs the command-line tool.

如何将字符串转换为数字,或者是js中的向量?

How can I convert a string to something numeric, a vector perhaps in js?

推荐答案

我确定您已考虑分配给遇到整数的每个新单词.您必须在某个地方保持跟踪,但这是一种选择.

I'm sure you've considered assigning each new word you encounter an integer. You'll have to keep track somewhere, but that's one option.

您还可以使用js具有的任何内置哈希方法.

You could also use whatever built-in hash method js has.

如果您不介意发生一些哈希冲突,并且生成的整数的大小无关紧要,那么我可以推荐我之前使用过几次的技巧.

If you don't mind a few hash collisions, and the size of the resulting integers doesn't matter, may I recommend a trick I've used a few times before.

  • 为每个字母分配素数(基于
  • Assign each letter a prime number based on its frequency:

因此,e = 2t=3a=5等,可以为我们提供:

So, e = 2, t=3, a=5, etc., which gives us:

2       e
3       t
5       a
7       o
11      i
13      n
17      s
19      h
23      r
29      d
31      l
37      c
41      u
43      m
47      w
53      f
59      g
61      y
67      p
71      b
73      v   
79      k
83      j
89      x
97      q
101     z

  • 乘以单词中每个字母对应的值
  • 因此,value73*5*31*41*2. corresponding37*7*23*23....每个唯一的集合都给出一个唯一的答案.它会碰撞字谜,因此我们意外地构建了一个字谜检测器.

    So, value is 73*5*31*41*2. corresponding is 37*7*23*23.... Each unique set gives a unique answer. It collides for anagrams, so we've accidentally built an anagram detector.

    但是,实际上没有语言上可行的方法来做到这一点.我怀疑word2vec只是将任意整数分配给字符串.

    There isn't really a linguistically sound way to do this, though. I suspect word2vec just assigns arbitrary integers to strings.

    这篇关于将nl字符串转换为向量或某些数字等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆