如何在Spark中将n-gram组合成一个词汇表? [英] How to combine n-grams into one vocabulary in Spark?

查看：145 发布时间：2020/5/18 0:47:58 python apache-spark nlp pyspark apache-spark-ml

本文介绍了如何在Spark中将n-gram组合成一个词汇表?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

想知道是否有内置的Spark功能可以将1-，2-，n-gram功能组合到单个词汇表中.在NGram中设置n=2然后调用CountVectorizer会导致仅包含2克的字典.我真正想要的是将所有常见的1克，2克等组合到我的语料库的一本字典中.

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus.

推荐答案

您可以训练单独的NGram和CountVectorizer模型，并使用VectorAssembler进行合并.

You can train separate NGram and CountVectorizer models and merge using VectorAssembler.

from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline


def build_ngrams(inputCol="tokens", n=3):

    ngrams = [
        NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)
    ]

    vectorizers = [
        CountVectorizer(inputCol="{0}_grams".format(i),
            outputCol="{0}_counts".format(i))
        for i in range(1, n + 1)
    ]

    assembler = [VectorAssembler(
        inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
        outputCol="features"
    )]

    return Pipeline(stages=ngrams + vectorizers + assembler)

示例用法:

df = spark.createDataFrame([
  (1, ["a", "b", "c", "d"]),
  (2, ["d", "e", "d"])
], ("id", "tokens"))

build_ngrams().fit(df).transform(df)

这篇关于如何在Spark中将n-gram组合成一个词汇表?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Spark中将n-gram组合成一个词汇表? [英] How to combine n-grams into one vocabulary in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Spark中将n-gram组合成一个词汇表? [英] How to combine n-grams into one vocabulary in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭