将数据帧转换为JSON(在pyspark中),然后选择所需的字段 [英] Converting a dataframe into JSON (in pyspark) and then selecting desired fields

查看:144
本文介绍了将数据帧转换为JSON(在pyspark中),然后选择所需的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手.我有一个包含某些分析结果的数据框.我将该数据帧转换为JSON,以便可以在Flask应用中显示它:

I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:

results = result.toJSON().collect()

下面是我的json文件中的示例条目.然后,我尝试运行一个for循环以获取特定结果:

An example entry in my json file is below. I then tried to run a for loop in order to get specific results:

{"userId":"1","systemId":"30","title":"interest"}

for i in results:
    print i["userId"]

这根本不起作用,并且我得到如下错误:Python(json):TypeError:预期的字符串或缓冲区

This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer

我使用了json.dumpsjson.loads,但仍然没有使用-我不断收到错误,例如字符串索引必须是整数,以及上面的错误.

I used json.dumps and json.loads and still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.

然后我尝试了这个:

  print i[0]

这给了我json"{"中的第一个字符,而不是第一行.我真的不知道该怎么办,有人可以告诉我我要去哪里错吗?

This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?

非常感谢.

推荐答案

如果result.toJSON().collect()的结果是JSON编码的字符串,则可以使用json.loads()将其转换为dict.您遇到的问题是,当您使用for循环迭代dict时,会得到dict的键.在您的for循环中,您将密钥视为一个dict,而实际上却只是一个string.试试这个:

If the result of result.toJSON().collect() is a JSON encoded string, then you would use json.loads() to convert it to a dict. The issue you're running into is that when you iterate a dict with a for loop, you're given the keys of the dict. In your for loop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:

# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())

for key in results:
    print results[key]

# To decode the entire DataFrame iterate over the result
# of toJSON()

def print_rows(row):
    data = json.loads(row)
    for key in data:
        print "{key}:{value}".format(key=key, value=data[key])


results = result.toJSON()
results.foreach(print_rows)    

编辑:问题是

The issue is that collect returns a list, not a dict. I've updated the code. Always read the docs.

collect()返回一个包含此RDD中所有元素的列表.

collect() Return a list that contains all of the elements in this RDD.

注意仅当结果数组为 预计会很小,因为所有数据都已加载到驱动程序的 记忆.

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

EDIT2 :我强调得不够,总是阅读文档.

I can't emphasize enough, always read the docs.

EDIT3 :查看 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆