在pyspark中读取json文件 [英] reading json file in pyspark

查看:45
本文介绍了在pyspark中读取json文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 PySpark 的新手,下面是我来自 kafka 的 JSON 文件格式.

<代码>{标题":{"平台":"atm",版本":2.0"}细节":[{"abc":"3",定义":4"},{"abc":"5",定义":6"},{"abc":"7",定义":8"}]}

如何详细阅读所有 "abc" "def" 的值并将其添加到这样的新列表中 [(1,2),(3,4),(5,6),(7,8)].新列表将用于创建 spark 数据框.我如何在 pyspark 中执行此操作.我尝试了以下代码.

parsed = messages.map(lambda (k,v): json.loads(v))列表 = []summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))输出 = summed.collect()打印输出

它会产生错误要解压的值太多"

语句summed.collect()

下面的错误信息<块引用>

16/09/12 12:46:10 信息弃用:不推荐使用 mapred.task.is.map.相反,使用 mapreduce.task.ismap 16/09/12 12:46:10 INFO 弃用:mapred.task.partition 已弃用.相反,使用mapreduce.task.partition 16/09/12 12:46:10 信息弃用:mapred.job.id 已弃用.相反,使用 mapreduce.job.id 16/09/1212:46:10 错误执行器:阶段 0.0 中任务 1.0 中的异常(TID 1)org.apache.spark.api.python.PythonException:回溯(最近最后调用):文件"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",第 111 行,主要进程()文件/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",第 106 行,处理中serializer.dump_stream(func(split_index, iterator), outfile) 文件"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py",第 263 行,在 dump_stream 中vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack

解决方案

首先,json 无效.在标题后面缺少 ,.

话虽如此,让我们以这个 json 为例:

{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}

这可以通过以下方式处理:

<预><代码>>>>df = sqlContext.jsonFile('test.json')>>>df.first()行(详细信息=[行(abc='3', def='4'), 行(abc='5', def='6'), 行(abc='7', def='8')], header=Row(platform='atm', version='2.0'))>>>df = df.flatMap(lambda row: row['details'])PythonRDD[38] 在 RDD at PythonRDD.scala:43>>>df.collect()[行(abc='3', def='4'), 行(abc='5', def='6'), 行(abc='7', def='8')]>>>df.map(lambda entry: (int(entry['abc']), int(entry['def']))).collect()[(3, 4), (5, 6), (7, 8)]

希望这会有所帮助!

I'm new to PySpark, Below is my JSON file format from kafka.

{
        "header": {
        "platform":"atm",
        "version":"2.0"
       }
        "details":[
       {
        "abc":"3",
        "def":"4"
       },
       {
        "abc":"5",
        "def":"6"
       },
       {
        "abc":"7",
        "def":"8"
       }    
      ]
    }

how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code.

parsed = messages.map(lambda (k,v): json.loads(v))
list = []
summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))
output = summed.collect()
print output

It produces the error 'too many values to unpack'

Error message below at statement summed.collect()

16/09/12 12:46:10 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/09/12 12:46:10 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/09/12 12:46:10 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/09/12 12:46:10 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack

解决方案

First of all, the json is invalid. After the header a , is missing.

That being said, lets take this json:

{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}

This can be processed by:

>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')], header=Row(platform='atm', version='2.0'))

>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43

>>> df.collect()
[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')]

>>> df.map(lambda entry: (int(entry['abc']),     int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]

Hope this helps!

这篇关于在pyspark中读取json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆