使用pickle和sys.stdin在Hadoop中加载defaultdict [英] Loading a defaultdict in Hadoop using pickle and sys.stdin
问题描述
我在一个小时前发布了一个类似的问题,但是在意识到我提出了错误的问题后就删除了它。我有以下pickled: defaultdict
:
ccollections
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V我喜欢
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'
使用Hadoop时,输入总是使用以下内容读入:
for sys.stdin中的行:
我尝试阅读pickled defaultdict
使用此:
myDict = pickle.load(sys.stdin)
for text ,date in myDict.iteritems():
但是没有用,剩下的代码就像我使用.load('filename.txt')在本地进行了测试,我是否做错了?我如何加载信息?
更新:在线教程后,我可以修改我的代码:
code> def read_input(file):
为第i行n文件:
打印行
def main(separator ='\t'):
myDict = read_input(sys.stdin)
这会打印出每一行,表明它正在成功读取文件 - 但是,没有 defaultdict的输出$保留c $ c>结构,并输出:
p769
aS'05-Aug -13 10:19'
p770
aS'05-Aug-13 15:19'
p771
表示我爱
很明显,这并不好。有人有什么建议吗?
为什么您的输入数据是pickle格式?你的输入数据来自哪里? Hadoop / MapReduce的目标之一是处理太大的数据,无法放入单个机器的内存中。因此,读取整个输入数据然后尝试反序列化它与MR设计范例相反,并且很可能甚至不能用于生产规模的数据集。解决方案是将输入数据格式化为例如TSV文本文件,每行只有一个字典元组。然后你可以自己处理每个元组,例如:
用于sys.stdin中的行:
tuple = line .split(\ t)
key,value =进程(元组)
emit(key,value)
I posted a similar question about an hour ago, but have since deleted it after realising I was asking the wrong question. I have the following pickled defaultdict
:
ccollections
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V"I love that"
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'
When using Hadoop, the input is always read in using:
for line in sys.stdin:
I tried reading the pickled defaultdict
using this:
myDict = pickle.load(sys.stdin)
for text, date in myDict.iteritems():
But to no avail. The rest of the code works as I tested it locally using .load('filename.txt'). Am I doing this wrong? How can I load the information?
Update:
After following an online tutorial, I can amend my code to this:
def read_input(file):
for line in file:
print line
def main(separator='\t'):
myDict = read_input(sys.stdin)
This prints out each line, showing it is successfully reading the file - however, no semblence of the defaultdict
structure is kept, with this output:
p769
aS'05-Aug-13 10:19'
p770
aS'05-Aug-13 15:19'
p771
as"I love that"
Obviously this is no good. Does anybody have any suggestions?
Why is your input data in the pickle format? Where does your input data come from? One of the goals of Hadoop/MapReduce is to process data that's too large to fit into the memory of a single machine. Thus, reading the whole input data and then trying to deserialize it runs contrary to the MR design paradigm and most likely won't even work with production-scale data sets.
The solution is to format your input data as a, for example, TSV text file with exactly one tuple of your dictionary per row. You can then process each tuple on its own, e.g.:
for line in sys.stdin:
tuple = line.split("\t")
key, value = process(tuple)
emit(key, value)
这篇关于使用pickle和sys.stdin在Hadoop中加载defaultdict的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!