Zeppelin:Scala 数据框到 python [英] Zeppelin: Scala Dataframe to python
问题描述
如果我有一个带有 DataFrame 的 Scala 段落,我可以与 python 共享和使用它.(据我所知,pyspark 使用
您可以将 DataFrame
注册为 Scala 中的临时表:
//Spark 1.x 中的 registerTempTabledf.createTempView("df")
并在 Python 中使用 SQLContext.table
读取它:
df = sqlContext.table("df")
如果你真的想使用 put
/get
你必须从头开始构建 Python DataFrame
:
z.put("df", df: org.apache.spark.sql.DataFrame)
from pyspark.sql import DataFramedf = DataFrame(z.get("df"), sqlContext)
要使用 matplotlib
绘图,您必须使用 collect
或 toPandas
将 DataFrame
转换为本地 Python 对象>:
pdf = df.toPandas()
请注意,它会向驱动程序获取数据.
另请参阅将 Spark DataFrame 从 Python 迁移到 Scala 以及 Zeppelin
If I have a Scala paragraph with a DataFrame, can I share and use that with python. (As I understand it pyspark uses py4j)
I tried this:
Scala paragraph:
x.printSchema
z.put("xtable", x )
Python paragraph:
%pyspark
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
the_data = z.get("xtable")
print the_data
sns.set()
g = sns.PairGrid(data=the_data,
x_vars=dependent_var,
y_vars=sensor_measure_columns_names + operational_settings_columns_names,
hue="UnitNumber", size=3, aspect=2.5)
g = g.map(plt.plot, alpha=0.5)
g = g.set(xlim=(300,0))
g = g.add_legend()
Error :
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark.py", line 222, in <module>
eval(compiledCode)
File "<string>", line 15, in <module>
File "/usr/local/lib/python2.7/dist-packages/seaborn/axisgrid.py", line 1223, in __init__
hue_names = utils.categorical_order(data[hue], hue_order)
TypeError: 'JavaObject' object has no attribute '__getitem__'
Solution:
%pyspark
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import StringIO
def show(p):
img = StringIO.StringIO()
p.savefig(img, format='svg')
img.seek(0)
print "%html <div style='width:600px'>" + img.buf + "</div>"
df = sqlContext.table("fd").select()
df.printSchema
pdf = df.toPandas()
g = sns.pairplot(data=pdf,
x_vars=["setting1","setting2"],
y_vars=["s4", "s3",
"s9", "s8",
"s13", "s6"],
hue="id", aspect=2)
show(g)
You can register DataFrame
as a temporary table in Scala:
// registerTempTable in Spark 1.x
df.createTempView("df")
and read it in Python with SQLContext.table
:
df = sqlContext.table("df")
If you really want to use put
/ get
you'll have build Python DataFrame
from scratch:
z.put("df", df: org.apache.spark.sql.DataFrame)
from pyspark.sql import DataFrame
df = DataFrame(z.get("df"), sqlContext)
To plot with matplotlib
you'll have convert DataFrame
to a local Python object with either collect
or toPandas
:
pdf = df.toPandas()
Please note that it will fetch data to the driver.
See also moving Spark DataFrame from Python to Scala whithn Zeppelin
这篇关于Zeppelin:Scala 数据框到 python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!