如何从PySpark的SQLLite db文件加载表? [英] How to load table from SQLLite db file from PySpark?

查看:99
本文介绍了如何从PySpark的SQLLite db文件加载表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从存储在本地磁盘上的SQLLite .db文件加载表.在PySpark中有什么干净的方法可以做到这一点?

I am trying to load table from a SQLLite .db file stored at local disk. Is there any clean way to do this in PySpark?

当前,我正在使用一种有效但不那么优雅的解决方案.首先,我通过sqlite3使用熊猫来读取表格.一个问题是在过程模式中信息不会传递(可能是问题,也可能不是问题).我想知道是否有不使用Pandas的直接加载表的方法.

Currently, I am using a solution that works but not as elegant. First I read the table using pandas though sqlite3. One concern is that during the process schema information is not passed (may or may not be a problem). I am wondering whether there is a direct way to load the table without using Pandas.

import sqlite3
import pandas as pd

db_path = 'alocalfile.db'
query = 'SELECT * from ATableToLoad'

conn = sqlite3.connect(db_path)
a_pandas_df = pd.read_sql_query(query, conn)

a_spark_df = SQLContext.createDataFrame(a_pandas_df)

似乎有一种使用jdbc的方法,但是我还没有弄清楚如何在PySpark中使用它.

There seems a way using jdbc to do this, but I have not figure out how to use it in PySpark.

推荐答案

所以首先,您需要在路径中使用JDBC驱动程序jar启动pyspark下载sqllite jdbc驱动程序,并在下面提供jar路径. https://bitbucket.org/xerial/sqlite-jdbc/downloads/sqlite-jdbc-3.8.6.jar

So first thing, you would need is to startup pyspark with JDBC driver jar in path Download the sqllite jdbc driver and provide the jar path in below . https://bitbucket.org/xerial/sqlite-jdbc/downloads/sqlite-jdbc-3.8.6.jar

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

有关上述pyspark命令的说明,请参见以下帖子

For explaination of above pyspark command, see below post

Apache Spark:JDBC连接无效

现在这是您的操作方法:-

Now here is how you would do it:-

现在要读取sqlite数据库文件,只需将其读入spark数据框

Now to read the sqlite database file, simply read it into spark dataframe

df = sqlContext.read.format('jdbc').\
     options(url='jdbc:sqlite:Chinook_Sqlite.sqlite',\
     dbtable='employee',driver='org.sqlite.JDBC').load()

df.printSchema()查看您的架构.

完整代码:- https://github.com/charles2588/bluemixsparknotebooks/blob/master/Python/sqllite_jdbc_bluemix.ipynb

谢谢,查尔斯.

这篇关于如何从PySpark的SQLLite db文件加载表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆