PySpark groupByKey返回pyspark.resultiterable.ResultIterable [英] PySpark groupByKey returning pyspark.resultiterable.ResultIterable

查看:5148
本文介绍了PySpark groupByKey返回pyspark.resultiterable.ResultIterable的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出为什么我的groupByKey是返回以下内容:

  [(0,<在0x7fc659e0a210&GT pyspark.resultiterable.ResultIterable对象),(1,< pyspark.resultiterable.ResultIterable对象在0x7fc659e0a4d0>),(2<在0x7fc659e0a390&GT pyspark.resultiterable.ResultIterable对象),(3,&下;在0x7fc659e0a290&GT pyspark.resultiterable.ResultIterable对象;),(4,&下;在0x7fc659e0a450&GT pyspark.resultiterable.ResultIterable对象;),(5,&所述; pyspark。在0x7fc659e0a350&GT resultiterable.ResultIterable对象;),(6,&下;在0x7fc659e0a1d0&GT pyspark.resultiterable.ResultIterable对象;),(7,&下;在0x7fc659e0a490&GT pyspark.resultiterable.ResultIterable对象;),(8,所述; pyspark.resultiterable。在0x7fc659e0a050&GT ResultIterable对象;),(9,&下;在0x7fc659e0a650&GT pyspark.resultiterable.ResultIterable对象;)]

我已经flatMapped看起来像这样的值:

  [(0,u'D'),(0,u'D'),(0,u'D'),(0,u'D'),( 0,u'D'),(0,u'D'),(0,u'D'),(0,u'D'),(0,u'D'),(0,u'D ')]

我做的只是一个简单的:

  groupRDD = columnRDD.groupByKey()


解决方案

什么你找回是一个对象,它允许您遍历结果。您可以通过调用值列表(),例如转groupByKey结果到列表中。

 例如= sc.parallelize([(0,u'D'),(0,u'D'),(1,u'E'),(2,U 'F')])example.groupByKey()。收集()
#给出[(0,< pyspark.resultiterable.ResultIterable对象......]。example.groupByKey()图(波长X:(X [0],列表(x [1])))。收集()
#给出[(0,[u'D',u'D']),(1,[u'E']),(2,[u'F'])]

I am trying to figure out why my groupByKey is returning the following:

[(0, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210>), (1, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a4d0>), (2, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a390>), (3, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a290>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a450>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a350>), (6, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a1d0>), (7, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a490>), (8, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a050>), (9, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a650>)]

I have flatMapped values that look like this:

[(0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D')]

I'm doing just a simple:

groupRDD = columnRDD.groupByKey()

解决方案

What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.

example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])

example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]

example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]

这篇关于PySpark groupByKey返回pyspark.resultiterable.ResultIterable的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆