如何将numpy数组元素转换为RDD列值 [英] How to convert numpy array elements to spark RDD column values

查看:879
本文介绍了如何将numpy数组元素转换为RDD列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我准备使用spark数据框(不是熊猫)的内置CSV打印工具.我已经建立了IndexedRowMatrix.因此,其中有稀疏的数组列.通过x.vector.toArray()映射调用(下面)执行到密集数组的转换.我似乎无法弄清楚如何将密集的numpy数组元素放入Spark DataFrame的各个列中. (请不要介绍熊猫.)如何将此RDD转换为7列数据帧,该数据帧由一个字符串列和六个整数列组成?到目前为止,我的代码:

I'm getting ready to use the built-in CSV printing facility of the spark dataframe (not pandas). I have an IndexedRowMatrix already built. As such there are sparse array columns in it. The conversion to dense array is performed by the x.vector.toArray() map call (below). I cannot seem to figure out, how to get the dense numpy array elements into individual columns of a Spark DataFrame. (I am not introducing pandas please.) How to get this RDD into 7 column dataframe, consisting of a string column and six integer columns? My code so far:

X = CoordinateMatrix(sc.parallelize(entries)) 
Xirm = X.toIndexedRowMatrix()
colnames = "username," + ','.join(str(cell) for cell in itemids.keys())  # Make CSV header line
# Might need this for speed: Arrow:  https://bryancutler.github.io/createDataFrame/  See above conf=...
XX = Xirm.rows.map(lambda x: (lu[x.index], x.vector.toArray())) # ?
print(XX.take(2))
df = XX.toDF() #TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
#df.write.csv(header=colnames, path=out_filename)

以下是take(2)来查看数据示例:

Here is the take(2) to see an example of the data:

[('kygiacomo', array([ 0.,  1.,  0.,  0.,  0.,  0.])), ('namohysip', array([ 1.,  0.,  0.,  0.,  0.,  0.]))]

看到的问题是RDD元组有2列,但是我在DataFrame中需要7列.列数是动态确定的,我在colnames变量中有列名,但是我也不知道如何在这两个列中输入.同样,目标是通过使用Spark的DAtaFrame内置CSV写入功能来输出等价"的CSV文件(很多部分文件都可以). (Spark 2.3.0是常驻的.)理想情况下,实数将转换为int,并且没有引号引起任何数据值.但是目前2到7列转换是一个非常棘手的问题.感谢您的提示.

See the problem is the RDD tuple has 2 columns, but I need 7 columns in a DataFrame. The number of columns is dynamically determined and I have the column names in the colnames variable but I don't know how to feed that in either. Again, the goal is to output a CSV file "equivalent" (many partial files is OK) by using spark's built-in CSV writing function of the DAtaFrame. (Spark 2.3.0 is resident.) The reals will be converted to ints ideally, and no quotes surrounding any data values. But the 2-to-7 column conversion is the really tough problem here at the moment. Thanks for tips.

推荐答案

简单地转换为普通的Python类型并解压缩应该可以解决问题:

A simple conversion to plain Python types and unpacking should do the trick:

Xirm.rows.map(lambda x: (lu[x.index], *x.vector.toArray().tolist()))

Xirm.rows.map(lambda x: [lu[x.index]] + x.vector.toArray().tolist())

这篇关于如何将numpy数组元素转换为RDD列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆