Spark:句子上的StringIndexer [英] Spark: StringIndexer on sentences

查看:135
本文介绍了Spark:句子上的StringIndexer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一列句子上做一些StringIndexer,即将单词列表转换为整数列表。

I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers.

例如:

输入数据集

  (1, ["I", "like", "Spark"])
  (2, ["I", "hate", "Spark"])

我预计 StringIndexer 之后的输出如下:

I expected the output after StringIndexer to be like:

  (1, [0, 2, 1])
  (2, [0, 3, 1])

理想情况下,我想将此类转换作为Pipeline的一部分,以便我可以将变换器连接在一起并序列化以进行在线服务。

Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving.

是Spark本身就支持这个东西吗?

Is this something Spark support natively?

谢谢!

推荐答案

标准用于将文本转换为要素的变形金刚 CountVectorizer


CountVectorizer和CountVectorizerModel旨在帮助转换文本集合文件到令牌计数的向量。

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts.

HashingTF


使用散列技巧将一系列术语映射到其术语频率。目前,我们使用Austin Appleby的MurmurHash 3算法(MurmurHash3_x86_32)来计算术语对象的哈希码值。由于使用简单的模来将散列函数转换为列索引,因此建议使用2的幂作为numFeatures参数;否则,功能将不会均匀映射到列。

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

两者都有二进制选项,可用于从计数切换到二进制矢量。

Both have binary option which can used to switch from count to binary vector.

没有内置 Transfomer 可以给出你想要的确切结果(对于ML没有用)算法)买你可以爆炸申请 StringIndexer ,和 collect_list / collect_set

There is no builtin Transfomer that can give exact result you want (it wouldn't be useful for ML algorithms) buy you can explode apply StringIndexer, and collect_list / collect_set:

import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline


val df = Seq(
  (1, Array("I", "like", "Spark")), (2, Array("I", "hate", "Spark"))
).toDF("id", "words")

val pipeline = new Pipeline().setStages(Array(
  new SQLTransformer()
    .setStatement("SELECT id, explode(words) as word FROM __THIS__"),
  new StringIndexer().setInputCol("word").setOutputCol("index"),
  new SQLTransformer()
    .setStatement("""SELECT id, COLLECT_SET(index) AS values 
                     FROM __THIS__ GROUP BY id""")
))

pipeline.fit(df).transform(df).show

// +---+---------------+                      
// | id|         values|
// +---+---------------+
// |  1|[0.0, 1.0, 3.0]|
// |  2|[2.0, 0.0, 1.0]|
// +---+---------------+

使用 CountVectorizer udf

import org.apache.spark.ml.linalg._


spark.udf.register("indices", (v: Vector) => v.toSparse.indices)

val pipeline = new Pipeline().setStages(Array(
  new CountVectorizer().setInputCol("words").setOutputCol("vector"),
  new SQLTransformer()
    .setStatement("SELECT *, indices(vector) FROM __THIS__")
))

pipeline.fit(df).transform(df).show

// +---+----------------+--------------------+-------------------+
// | id|           words|              vector|UDF:indices(vector)|
// +---+----------------+--------------------+-------------------+
// |  1|[I, like, Spark]|(4,[0,1,3],[1.0,1...|          [0, 1, 3]|
// |  2|[I, hate, Spark]|(4,[0,1,2],[1.0,1...|          [0, 1, 2]|
// +---+----------------+--------------------+-------------------+

这篇关于Spark:句子上的StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆