Tensorflow.js tokenizer [英] Tensorflow.js tokenizer

查看:195
本文介绍了Tensorflow.js tokenizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Machine Learning和Tensorflow的新手,因为我不知道python所以我决定使用javascript版本(可能更像是一个包装器)。

I'm new to Machine Learning and Tensorflow, since I don't know python so I decide to use there javascript version (maybe more like a wrapper).

问题是我尝试构建一个处理自然语言的模型。因此,第一步是将文本标记化,以便将数据提供给模型。我做了很多研究,但他们中的大多数都使用了python版本的tensorflow,它使用的方法如下: tf.keras.preprocessing.text.Tokenizer 我找不到类似的在tensorflow.js中。我陷入了这一步,不知道如何将文本传输到可以提供给模型的矢量。请帮助:)

The problem is I tried to build a model that process the Natural Language. So the first step is tokenizer the text in order to feed the data to model. I did a lot research, but most of them are using python version of tensorflow that use method like: tf.keras.preprocessing.text.Tokenizer which I can't find similar in tensorflow.js. I'm stuck in this step and don't know how can I transfer text to vector that can feed to model. Please help :)

推荐答案

要将文本转换为向量,有很多方法可以做到,所有这些都取决于用例。最直观的一个是使用术语频率的那个,即,给定语料库的词汇(所有可能的词),所有文本文档将被表示为向量,其中每个条目表示文本文档中词的出现。

To transform text to vectors, there are lots of ways to do it, all depending on the use case. The most intuitive one, is the one using the term frequency, i.e , given the vocabulary of the corpus (all the words possible), all text document will be represented as a vector where each entry represents the occurrence of the word in text document.

这个词汇:

["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]

以下文字:

["machine", "is", "a", "field", "machine", "is", "is"] 

将被转换为此vector:

will be transformed as this vector:

[2, 0, 3, 1, 0, 1, 0, 0, 0] 

这种技术的一个缺点是矢量中可能有大量的0,其大小与词汇量相同语料库。这就是为什么还有其他技术。但是,通常会引用 词汇 。使用 tf.idf

One of the disadvantage of this technique is that there might be lots of 0 in the vector which has the same size as the vocabulary of the corpus. That is why there are others techniques. However the bag of words is often referred to. And there is a slight different version of it using tf.idf

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))

还有以下模块可能有助于实现您的目标

There is also the following module that might help to achieve what you want

这篇关于Tensorflow.js tokenizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆