Spark:句子上的 StringIndexer [英] Spark: StringIndexer on sentences

查看:21
本文介绍了Spark:句子上的 StringIndexer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对一列句子执行 StringIndexer 操作,即将单词列表转换为整数列表.

I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers.

例如:

输入数据集:

  (1, ["I", "like", "Spark"])
  (2, ["I", "hate", "Spark"])

我希望 StringIndexer 之后的输出是这样的:

I expected the output after StringIndexer to be like:

  (1, [0, 2, 1])
  (2, [0, 3, 1])

理想情况下,我希望将这种转换作为 Pipeline 的一部分进行,以便我可以将 Transformer 耦合在一起并进行序列化以进行在线服务.

Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving.

这是 Spark 本身支持的东西吗?

Is this something Spark support natively?

谢谢!

推荐答案

用于将文本转换为特征的标准 TransformersCountVectorizer

Standard Transformers used for converting text to features are CountVectorizer

CountVectorizer 和 CountVectorizerModel 旨在帮助将文本文档集合转换为标记计数向量.

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts.

HashingTF:

使用散列技巧将一系列术语映射到它们的术语频率.目前我们使用 Austin Appleby 的 MurmurHash 3 算法(MurmurHash3_x86_32)来计算术语对象的哈希码值.由于使用简单的模将哈希函数转换为列索引,因此建议使用 2 的幂作为 numFeatures 参数;否则特征将不会被均匀地映射到列.

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

两者都有 binary 选项,可用于从计数切换到二进制向量.

Both have binary option which can used to switch from count to binary vector.

没有内置的 Transfomer 可以给出你想要的准确结果(它对 ML 算法没有用)买你可以 explode 应用 StringIndexercollect_list/collect_set:

There is no builtin Transfomer that can give exact result you want (it wouldn't be useful for ML algorithms) buy you can explode apply StringIndexer, and collect_list / collect_set:

import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline


val df = Seq(
  (1, Array("I", "like", "Spark")), (2, Array("I", "hate", "Spark"))
).toDF("id", "words")

val pipeline = new Pipeline().setStages(Array(
  new SQLTransformer()
    .setStatement("SELECT id, explode(words) as word FROM __THIS__"),
  new StringIndexer().setInputCol("word").setOutputCol("index"),
  new SQLTransformer()
    .setStatement("""SELECT id, COLLECT_SET(index) AS values 
                     FROM __THIS__ GROUP BY id""")
))

pipeline.fit(df).transform(df).show

// +---+---------------+                      
// | id|         values|
// +---+---------------+
// |  1|[0.0, 1.0, 3.0]|
// |  2|[2.0, 0.0, 1.0]|
// +---+---------------+

使用 CountVectorizerudf:

import org.apache.spark.ml.linalg._


spark.udf.register("indices", (v: Vector) => v.toSparse.indices)

val pipeline = new Pipeline().setStages(Array(
  new CountVectorizer().setInputCol("words").setOutputCol("vector"),
  new SQLTransformer()
    .setStatement("SELECT *, indices(vector) FROM __THIS__")
))

pipeline.fit(df).transform(df).show

// +---+----------------+--------------------+-------------------+
// | id|           words|              vector|UDF:indices(vector)|
// +---+----------------+--------------------+-------------------+
// |  1|[I, like, Spark]|(4,[0,1,3],[1.0,1...|          [0, 1, 3]|
// |  2|[I, hate, Spark]|(4,[0,1,2],[1.0,1...|          [0, 1, 2]|
// +---+----------------+--------------------+-------------------+

这篇关于Spark:句子上的 StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆