使用pickle和sys.stdin在Hadoop中加载defaultdict [英] Loading a defaultdict in Hadoop using pickle and sys.stdin

查看:224
本文介绍了使用pickle和sys.stdin在Hadoop中加载defaultdict的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个小时前发布了一个类似的问题,但是在意识到我提出了错误的问题后就删除了它。我有以下pickled: defaultdict

  ccollections 
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V我喜欢
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'

使用Hadoop时,输入总是使用以下内容读入:

  for sys.stdin中的行:

我尝试阅读pickled defaultdict 使用此:

  myDict = pickle.load(sys.stdin)
for text ,date in myDict.iteritems():

但是没有用,剩下的代码就像我使用.load('filename.txt')在本地进行了测试,我是否做错了?我如何加载信息?

更新:

  

code> def read_input(file):
为第i行n文件:
打印行

def main(separator ='\t'):
myDict = read_input(sys.stdin)

这会打印出每一行,表明它正在成功读取文件 - 但是,没有 defaultdict的输出结构,并输出:

  p769 

aS'05-Aug -13 10:19'

p770

aS'05-Aug-13 15:19'

p771

表示我爱

很明显,这并不好。有人有什么建议吗?

为什么您的输入数据是pickle格式?你的输入数据来自哪里? Hadoop / MapReduce的目标之一是处理太大的数据,无法放入单个机器的内存中。因此,读取整个输入数据然后尝试反序列化它与MR设计范例相反,并且很可能甚至不能用于生产规模的数据集。



解决方案是将输入数据格式化为例如TSV文本文件,每行只有一个字典元组。然后你可以自己处理每个元组,例如:

 用于sys.stdin中的行:
tuple = line .split(\ t)
key,value =进程(元组)
emit(key,value)


I posted a similar question about an hour ago, but have since deleted it after realising I was asking the wrong question. I have the following pickled defaultdict:

ccollections
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V"I love that"
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'

When using Hadoop, the input is always read in using:

for line in sys.stdin:

I tried reading the pickled defaultdict using this:

myDict = pickle.load(sys.stdin)
for text, date in myDict.iteritems():

But to no avail. The rest of the code works as I tested it locally using .load('filename.txt'). Am I doing this wrong? How can I load the information?

Update:

After following an online tutorial, I can amend my code to this:

def read_input(file):
    for line in file:
        print line

def main(separator='\t'):
    myDict = read_input(sys.stdin)

This prints out each line, showing it is successfully reading the file - however, no semblence of the defaultdict structure is kept, with this output:

p769    

aS'05-Aug-13 10:19' 

p770    

aS'05-Aug-13 15:19' 

p771    

as"I love that" 

Obviously this is no good. Does anybody have any suggestions?

解决方案

Why is your input data in the pickle format? Where does your input data come from? One of the goals of Hadoop/MapReduce is to process data that's too large to fit into the memory of a single machine. Thus, reading the whole input data and then trying to deserialize it runs contrary to the MR design paradigm and most likely won't even work with production-scale data sets.

The solution is to format your input data as a, for example, TSV text file with exactly one tuple of your dictionary per row. You can then process each tuple on its own, e.g.:

for line in sys.stdin:
    tuple = line.split("\t")
    key, value = process(tuple)
    emit(key, value)

这篇关于使用pickle和sys.stdin在Hadoop中加载defaultdict的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆