PySpark:根据随机均匀分布创建数据框 [英] PySpark: create dataframe from random uniform disribution
问题描述
我正在尝试使用Spark中的随机均匀分布来创建数据框.我找不到有关如何创建数据框的任何内容,但是当我阅读文档时,发现 pyspark.mllib.random
有一个 RandomRDDs
对象,该对象具有一个可以通过随机均匀分布创建rdds的uniqueRDD
方法.
I am trying to create a dataframe using random uniform distribution in Spark. I couldn't find anything on how to create a dataframe but when I read the documentation I found that pyspark.mllib.random
has a RandomRDDs
object which has a uniformRDD
method which can create rdds from random uniform distribution.
但是问题是它不会创建二维rdds.有没有办法创建二维rdd或(最好是)数据框?
But the problem is that it doesn't create two dimensional rdds. Is there a way I can create a two-dimensional rdd or (preferably) dataframe?
我可以创建一些rdds并使用它们来创建一个数据框,但是我正在使用的数据集有很多字段(超过100个)并创建100多个rdds,然后对其进行压缩似乎并不高效.
I can create a few rdds and use them to create a dataframe but the dataset I am using has many fields(100+) and creating 100s of rdds and then zipping them doesn't seem efficient.
推荐答案
您可以生成统一的Vector RDD并将其转换为DataFrame
You can generate uniform Vectors RDD and convert it to a DataFrame
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.random import RandomRDDs
data = RandomRDDs.uniformVectorRDD(sc, 10,10) \ # numpy.ndarray are not supported.
.map(lambda a : DenseVector(a)) \
.map(lambda a : (a,)) \
.toDF(['features'])
data.show()
# +--------------------+
# | features|
# +--------------------+
# |[0.97051622217872...|
# |[0.39165143210012...|
# |[0.70067295066813...|
# |[0.59568555130484...|
# |[0.16572531686478...|
# |[0.92494190257048...|
# |[0.43691499080129...|
# |[0.28320336307013...|
# |[0.85420768678698...|
# |[0.65923297006740...|
# +--------------------+
For more info, you can always check to official documentation here.
(检查评论)
如果要将每个值放在单独的列中,则无需将向量转换为DenseVector而是将其转换为列表:
If you want to have each value in a separated column, you don't need to convert your vectors into a DenseVector but into a list :
data = RandomRDDs.uniformVectorRDD(sc, 10,10).map(lambda a : a.tolist()).toDF()
data1.show()
# +-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+-------------------+
# | _1| _2| _3| _4| _5| _6| _7| _8| _9| _10|
# +-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+-------------------+
# |0.14585743784778926| 0.803498096310468| 0.31605227000909253| 0.8119612820220813|0.41447836235778723| 0.2676439013488928| 0.8524783652359866| 0.5701076199786781|0.6693708605568874| 0.8111256775283068|
# | 0.1827511073189425| 0.3350517687462683| 0.7400032940623857| 0.7869460532004358| 0.6448914199353433| 0.9805601228284964|0.20020913675524243| 0.7922294214683878|0.9374972404332362| 0.6765087842364208|
# |0.38625776221583874|0.04229839224493681| 0.7734933051852422| 0.0274813429089541| 0.311445753826302|0.25698473390480325| 0.9437646814604557| 0.4741747733429049| 0.290710728473321| 0.677912271088622|
# | 0.7896873370148003| 0.1858840420861243| 0.3197437373418126|0.10097010041540833|0.10289933172316801| 0.5449368374946228| 0.4030450125686461| 0.21948568405399982|0.8930079107298496| 0.7519921983394425|
# | 0.815811790931526| 0.3634760983908547| 0.42601575700182837|0.13606388717010864| 0.5861222009300258| 0.3340860113942531| 0.2557956812340677| 0.43528056172400853|0.3922245296661778| 0.8912435252335149|
# |0.30392495415210397| 0.7925870450504611| 0.9030779298622288| 0.8727793109267047| 0.8158542803828924| 0.7931830841520005| 0.6282396202128951| 0.1420886768888291|0.8614276809589785|0.17436606175314684|
# | 0.9382134044434042| 0.6749506191750686|0.015443852959660331|0.12038319457909019| 0.417781126294975|0.07393488977646023|0.31885174813644857| 0.728226037613587|0.9952269580720621|0.07007086773721505|
# |0.13783951066912703| 0.7119354308993141| 0.42197923155036043|0.29716042608097326| 0.9738408655296322| 0.9868052613269893| 0.6935287164137466|0.037473358201903895|0.3495081198619411| 0.8435628173797828|
# | 0.1587632683889939| 0.7360623327266481| 0.42321853435929413| 0.9677124294019807| 0.63138909800576|0.09938015379429832| 0.5399110874035429| 0.7668582384258967|0.7925729040215128| 0.1764801807830343|
# | 0.2588173671258266| 0.5196258205360417| 0.47988935453823345| 0.6699354533063644| 0.8233338127383266| 0.8249394954169588|0.32268906006759734| 0.2768177979947253|0.9951067081655113| 0.5263299321371093|
# +-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+-------------------+
这篇关于PySpark:根据随机均匀分布创建数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!