通过MAP返回的元组的列表（或迭代器）（PySpark） [英] List (or iterator) of tuples returned by MAP (PySpark)

查看：644 发布时间：2016/5/22 16:08:34 python apache-spark

本文介绍了通过MAP返回的元组的列表（或迭代器）（PySpark）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个映射器的方法：

I have a mapper method:

def mapper(value):
    ...
    for key, value in some_list:
        yield key, value

我需要的是不是真的从普通的单词计数的例子远，其实。我已经有工作的脚本，但前提是映射器的方法看起来像：

what I need is not really far from the ordinary wordcount example, actually. I already have working script, but only if the mapper method looks like that:

def mapper(value):
    ...
    return key, value

这是它的呼叫的样子：

sc.textFile(sys.argv[2], 1).map(mapper).reduceByKey(reducer).collect()

我花了2小时，试图写code将支持在发电机映射。但不能这样做。我甚至同意只返回一个列表：

I spent 2 hours trying to write code that would support generators in mapper. But couldn't do that. I even agree to just returning a list:

def mapper(value):
    ...
    result_list = []
    for key, value in some_list:
        result_list.append( key, value )
    return result_list

下面：的https://groups.google.com/forum/#!searchin/spark-users/flatmap$20multiple/spark-users/1WqVhRBaJsU/-D5QRbenlUgJ
我发现，我应该使用flatMap，但它并没有这样的伎俩 - 我的减速然后开始变得像（键1，值1，密钥2，值2，值3，...）输入 - 但它应该是[（KEY1，值1 ），（键，值2，值3）...]。换句话说，减速开始服用只有单件，不知道它是否是一个值或密钥，如果价值 - 它属于哪个键

Here: https://groups.google.com/forum/#!searchin/spark-users/flatmap$20multiple/spark-users/1WqVhRBaJsU/-D5QRbenlUgJ I found that I should use flatMap, but it didn't do the trick - my reducer then started to get inputs like (key1, value1, key2, value2, value3, ...) - but it should be [(key1, value1), (key2, value2, value3)...]. In other words, reducer started taking only single pieces, and don't know whether it's a value or a key, and if value - to which key it belongs.

因此，如何使用返回迭代器或列表映射器？

So how to use mappers that return iterators or lists?

谢谢！

推荐答案

您可以使用 flatMap 如果您想返回多个输出地图功能。

You can use flatMap if you want a map function that returns multiple outputs.

传递给 flatMap 函数可以返回一个可迭代：

The function passed to flatMap can return an iterable:

>>> words = sc.textFile("README.md")
>>> def mapper(line):
...     return ((word, 1) for word in line.split())
...
>>> words.flatMap(mapper).take(4)
[(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
>>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
>>> counts.take(5)
[(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]

它也可以是一个发电机功能：

It can also be a generator function:

>>> words = sc.textFile("README.md")
>>> def mapper(line):
...     for word in line.split():
...         yield (word, 1)
...
>>> words.flatMap(mapper).take(4)
[(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
>>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
>>> counts.take(5)
[(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]

您提到，您是否尝试过 flatMap 但它夷为平地都记录下来，以一个List [键，值，键，值，...] 不是一个List [（键，值），（键，值）...] 。我怀疑这是在你的地图功能的问题。如果你仍然遇到这个问题，你可以发布自己的地图功能更完整的版本？


You mentioned that you tried flatMap but it flattened everything down to a list [key, value, key, value, ...] instead of a list [(key, value), (key, value)...]of key-value pairs.  I suspect that this is a problem in your map function.  If you're still experiencing this problem, could you post a more complete version of your map function?

                        这篇关于通过MAP返回的元组的列表（或迭代器）（PySpark）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

通过MAP返回的元组的列表（或迭代器）（PySpark） [英] List (or iterator) of tuples returned by MAP (PySpark)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

通过MAP返回的元组的列表（或迭代器）（PySpark） [英] List (or iterator) of tuples returned by MAP (PySpark)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭