星火阅读python3泡菜作为输入 [英] Spark reading python3 pickle as input

查看:274
本文介绍了星火阅读python3泡菜作为输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据可以作为套蟒蛇3腌文件。他们大多是大熊猫的数据帧序列化。

My data are available as sets of python 3 pickled files. Most of them are serialization of pandas data frames.

我想开始使用的火花,因为我需要更多的内存和CPU是一台计算机可以有。此外,我将使用HDFS的分布式存储。

I'd like to start using spark because I need more memory and CPU that one computer can have. Also, I'll use HDFS for distributed storage.

作为一个初学者,我没有找到相关资料,解释如何使用泡菜文件作为输入文件。

As a beginner, I didn't found relevant information explaining how to use pickle files as input file.

请问它的存在?如果不是,是否有什么解决办法?

Does it exists? If not, are there any workaround?

非常感谢

推荐答案

这在很大程度上取决于数据本身。一般来说星火不执行特别好时,它读取大,不分割文件。不过,你可以尝试使用 binaryFiles 方法和标准的Python工具相结合。让我们开始一个虚拟的数据:

A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:

import tempfile
import pandas as pd
import numpy as np

outdir = tempfile.mkdtemp()

for i in range(5):
    pd.DataFrame(
        np.random.randn(10, 2), columns=['foo', 'bar']
    ).to_pickle(tempfile.mkstemp(dir=outdir)[1])

接下来,我们可以通过阅读 bianryFiles 方法:

rdd = sc.binaryFiles(outdir)

和反序列化单个对象:

import pickle
from io import BytesIO

dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]

##         foo       bar
## 0 -0.162584 -2.179106
## 1  0.269399 -0.433037
## 2 -0.295244  0.119195

一个重要的说明是,它通常需要比一个简单的方法,如文本文件

另一种方法是并行只有路径和使用可以直接从一个分布式文件系统如 hdfs3 <读库/ A>。这通常意味着在显著糟糕的数据局部性的价格低的内存需求。

Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.

考虑这两个事实它通常是更好的序列,其中可装入一较高粒度的格式的数据。

Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.

注意

SparkContext pickleFile 方法,但名称可能会产生误导。它可以用来读取 SequenceFiles 含咸菜对象不是普通的Python咸菜。

SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

这篇关于星火阅读python3泡菜作为输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆