在pyspark中读取json文件 [英] reading json file in pyspark
问题描述
我是PySpark的新手,以下是我来自kafka的JSON文件格式.
I'm new to PySpark, Below is my JSON file format from kafka.
{
"header": {
"platform":"atm",
"version":"2.0"
}
"details":[
{
"abc":"3",
"def":"4"
},
{
"abc":"5",
"def":"6"
},
{
"abc":"7",
"def":"8"
}
]
}
如何详细读取所有"abc"
"def"
的值,并将其添加到像[(1,2),(3,4),(5,6),(7,8)]
这样的新列表中.新列表将用于创建火花数据框.我如何在pyspark中做到这一点.我尝试了以下代码.
how can I read through the values of all "abc"
"def"
in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]
. The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code.
parsed = messages.map(lambda (k,v): json.loads(v))
list = []
summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))
output = summed.collect()
print output
它会产生错误"要解包的值太多"
下面在语句summed.collect()
16/09/12 12:46:10弃用INFO:已弃用mapred.task.is.map. 相反,请使用mapreduce.task.ismap 16/09/12 12:46:10信息弃用: 不推荐使用mapred.task.partition.相反,使用 mapreduce.task.partition 16/09/12 12:46:10信息弃用: mapred.job.id已弃用.相反,请使用mapreduce.job.id 2012年9月16日 12:46:10错误执行程序:阶段1.0(TID 1)中的任务1.0中的异常 org.apache.spark.api.python.PythonException:追溯(最新 最后调用):文件 "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", 主线111 process()文件"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", 第106行,进行中 serializer.dump_stream(func(split_index,iterator),outfile)文件 "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py", 第263行,在dump_stream中 vs = list(itertools.islice(iterator,batch))文件",ValueError中的第1行:太多值无法解压缩
16/09/12 12:46:10 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/09/12 12:46:10 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/09/12 12:46:10 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/09/12 12:46:10 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack
推荐答案
首先,json无效.标头后缺少,
.
First of all, the json is invalid. After the header a ,
is missing.
话虽这么说,让我们看看这个json:
That being said, lets take this json:
{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}
这可以通过以下方式处理:
This can be processed by:
>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')], header=Row(platform='atm', version='2.0'))
>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43
>>> df.collect()
[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')]
>>> df.map(lambda entry: (int(entry['abc']), int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]
希望这会有所帮助!
这篇关于在pyspark中读取json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!