在多行中使用密集矢量爆炸列 [英] Explode column with dense vectors in multiple rows

查看：71 发布时间：2020/11/2 3:17:59 python apache-spark vector pyspark explode

本文介绍了在多行中使用密集矢量爆炸列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含两列的数据框:BrandWatchErwaehnungID和word_counts. word_counts列是CountVectorizer(稀疏向量)的输出.删除空行后，我创建了两列新列，其中一列包含稀疏矢量的索引，一列包含其值.

I have a Dataframe with two columns: BrandWatchErwaehnungID and word_counts. The word_counts column is the output of `CountVectorizer (a sparse vector). After dropped the empty rows I have created two new columns one with the indices of the sparse vector and one with their values.

help0 = countedwords_text['BrandWatchErwaehnungID','word_counts'].rdd\
    .filter(lambda x : x[1].indices.size!=0)\
    .map(lambda x : (x[0],x[1],DenseVector(x[1].indices) , DenseVector(x[1].values))).toDF()\
    .withColumnRenamed("_1", "BrandWatchErwaenungID").withColumnRenamed("_2", "word_counts")\
    .withColumnRenamed("_3", "word_indices").withColumnRenamed("_4", "single_word_counts")

由于火花不接受numpy.ndarray，我需要将它们转换为密集向量，然后再添加到我的数据帧中.我的问题是我现在想爆炸word_indices列上的Dataframe，但是pyspark.sql.functions中的explode方法仅支持数组或映射作为输入.

I needed to convert them to dense vectors before adding to my Dataframe due to spark did not accept numpy.ndarray. My problem is that I now want to explode that Dataframeon the word_indices column but the explode method from pyspark.sql.functions does only support arrays or map as input.

我尝试过:

help1 = help0.withColumn('b' , explode(help0.word_indices))

并出现以下错误:

由于数据类型不匹配而无法解析'explode(`word_indices')':函数explode的输入应为数组或映射类型

cannot resolve 'explode(`word_indices')' due to data type mismatch: input to function explode should be array or map type

然后我尝试:

help1 = help0.withColumn('b' , explode(help0.word_indices.toArray()))

哪个也没用... 有什么建议吗?

Which also did not worked... Any suggestions?

推荐答案

您必须使用udf:

from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
from pyspark.ml.linalg import *

@udf("array<integer>")
def indices(v):
   if isinstance(v, DenseVector):
      return list(range(len(v)))
   if isinstance(v, SparseVector):
      return v.indices.tolist()

df = spark.createDataFrame([
   (1, DenseVector([1, 2, 3])), (2, SparseVector(5, {4: 42}))], 
   ("id", "v"))

df.select("id", explode(indices("v"))).show()

# +---+---+
# | id|col|
# +---+---+
# |  1|  0|
# |  1|  1|
# |  1|  2|
# |  2|  4|
# +---+---+

这篇关于在多行中使用密集矢量爆炸列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在多行中使用密集矢量爆炸列 [英] Explode column with dense vectors in multiple rows

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在多行中使用密集矢量爆炸列 [英] Explode column with dense vectors in multiple rows

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭