Pyspark-将json字符串转换为DataFrame [英] Pyspark - converting json string to DataFrame
问题描述
我有一个包含简单json的test2.json文件:
I have a test2.json file that contains simple json:
{ "Name": "something", "Url": "https://stackoverflow.com", "Author": "jangcy", "BlogEntries": 100, "Caller": "jangcy"}
我已将文件上传到Blob存储,并从中创建了一个DataFrame:
I have uploaded my file to blob storage and I create a DataFrame from it:
df = spark.read.json("/example/data/test2.json")
然后我可以看到它而没有任何问题:
then I can see it without any problems:
df.show()
+------+-----------+------+---------+--------------------+
|Author|BlogEntries|Caller| Name| Url|
+------+-----------+------+---------+--------------------+
|jangcy| 100|jangcy|something|https://stackover...|
+------+-----------+------+---------+--------------------+
第二种情况: 我在笔记本中声明了相同的json字符串:
Second scenario: I have really the same json string declared within my notebook:
newJson = '{ "Name": "something", "Url": "https://stackoverflow.com", "Author": "jangcy", "BlogEntries": 100, "Caller": "jangcy"}'
我可以打印它,等等.但是现在,如果我想从中创建一个DataFrame:
I can print it etc. But now if I'd like to create a DataFrame from it:
df = spark.read.json(newJson)
我收到绝对URI中的相对路径"错误:
I get the 'Relative path in absolute URI' error:
'java.net.URISyntaxException: Relative path in absolute URI: { "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 249, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: { "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
我应该对newJson字符串应用其他转换吗?如果是,应该是什么?如果这太琐碎,请原谅我,因为我是Python和Spark的新手.
Should I apply additional transformations to the newJson string? If yes, what should them be? Please forgive me, if this is too trivial, as I am very new to Python and Spark.
我正在将Jupyter笔记本与PySpark3内核一起使用.
I am using Jupyter notebook with PySpark3 Kernel.
谢谢.
推荐答案
您可以执行以下操作
newJson = '{"Name":"something","Url":"https://stackoverflow.com","Author":"jangcy","BlogEntries":100,"Caller":"jangcy"}'
df = spark.read.json(sc.parallelize([newJson]))
df.show(truncate=False)
应该给出
+------+-----------+------+---------+-------------------------+
|Author|BlogEntries|Caller|Name |Url |
+------+-----------+------+---------+-------------------------+
|jangcy|100 |jangcy|something|https://stackoverflow.com|
+------+-----------+------+---------+-------------------------+
这篇关于Pyspark-将json字符串转换为DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!