数据框划分复杂排在Pyspark简单的行 [英] Dividing complex rows of dataframe to simple rows in Pyspark

查看：170 发布时间：2016/5/22 15:19:00 python apache-spark pyspark apache-spark-sql spark-dataframe

本文介绍了数据框划分复杂排在Pyspark简单的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这样的code：

 从pyspark进口SparkContext
从pyspark.sql进口SQLContext，行SC = SparkContext（）
sqlContext = SQLContext（SC）
文件= sqlContext.createDataFrame（[
    行（ID = 1，标题= [行（值= u'cars'，max_dist = 1000）]），
    行（ID = 2，标题= [行（值= u'horse公交'，max_dist = 50），行（值= u'normal公交'，max_dist = 100）]），
    行（ID = 3，标题= [行（值= u'Airplane'，max_dist = 5000）]），
    行（ID = 4，标题= [行（值= u'Bicycles'，max_dist = 20），行（值= u'Motorbikes'，max_dist = 80）]），
    行（n = 5，标题= [行（值= u'Trams'，max_dist = 15）]）]）documents.show（截断=假）
＃+ --- + ---------------------------------- +
＃| ID |标题|
＃+ --- + ---------------------------------- +
＃| 1 | [[1000年，汽车] |
＃| 2 | [[50，马公共汽车]，[100，正常的公共汽车] |
＃| 3 | [[5000，飞机] |
＃| 4 | [[20，自行车]，[80，摩托车] |
＃| 5 | [[15，电车] |
＃+ --- + ---------------------------------- +

我需要拆分所有的复合行（例如，2及4）多行同时保留'ID'，要得到这样的结果：

 ＃+ --- + -------------------------------- -  +
＃| ID |标题|
＃+ --- + ---------------------------------- +
＃| 1 | [1000，汽车] |
＃| 2 | [50，马公共汽车] |
＃| 2 | [100，正常的公共汽车] |
＃| 3 | [5000，飞机] |
＃| 4 | [20，自行车] |
＃| 4 | [80，摩托车] |
＃| 5 | [15，电车] |
＃+ --- + ---------------------------------- +

解决方案

只是爆炸是：

 从pyspark.sql.functions导入爆炸documents.withColumn（称号，爆炸（标题））
## + --- + ---------------- +
## | ID |标题|
## + --- + ---------------- +
## | 1 | [1000年，汽车] |
## | 2 | [50，马公共汽车] |
## | 2 | [100，正常的公共汽车] |
## | 3 | [5000，飞机] |
## | 4 | [20，自行车] |
## | 4 | [80，摩托车] |
## | 5 | [15，电车] |
## + --- + ---------------- +

I have this code:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
    Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
    Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
    Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
    Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
    Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])

documents.show(truncate=False)
#+---+----------------------------------+
#|id |title                             |
#+---+----------------------------------+
#|1  |[[1000,cars]]                     |
#|2  |[[50,horse bus], [100,normal bus]]|
#|3  |[[5000,Airplane]]                 |
#|4  |[[20,Bicycles], [80,Motorbikes]]  |
#|5  |[[15,Trams]]                      |
#+---+----------------------------------+

I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:

#+---+----------------------------------+
#|id |title                             |
#+---+----------------------------------+
#|1  |[1000,cars]                       |
#|2  |[50,horse bus]                    |
#|2  |[100,normal bus]                  |
#|3  |[5000,Airplane]                   |
#|4  |[20,Bicycles]                     |
#|4  |[80,Motorbikes]                   |
#|5  |[15,Trams]                        |
#+---+----------------------------------+

解决方案

Just explode it:

from pyspark.sql.functions import explode

documents.withColumn("title", explode("title"))
## +---+----------------+
## | id|           title|
## +---+----------------+
## |  1|     [1000,cars]|
## |  2|  [50,horse bus]|
## |  2|[100,normal bus]|
## |  3| [5000,Airplane]|
## |  4|   [20,Bicycles]|
## |  4| [80,Motorbikes]|
## |  5|      [15,Trams]|
## +---+----------------+

这篇关于数据框划分复杂排在Pyspark简单的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据框划分复杂排在Pyspark简单的行 [英] Dividing complex rows of dataframe to simple rows in Pyspark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

数据框划分复杂排在Pyspark简单的行 [英] Dividing complex rows of dataframe to simple rows in Pyspark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭