从SQL查询创建Spark数据框 [英] Create Spark Dataframe from SQL Query

查看：85 发布时间：2020/5/15 0:57:28 mysql sql scala apache-spark mysql-connector

本文介绍了从SQL查询创建Spark数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我确定这是一个简单的SQLContext问题，但是在Spark文档或Stackoverflow中找不到任何答案

I'm sure this is a simple SQLContext question, but I can't find any answer in the Spark docs or Stackoverflow

我想通过MySQL上的SQL查询创建Spark数据框

I want to create a Spark Dataframe from a SQL Query on MySQL

例如，我有一个复杂的MySQL查询，例如

For example, I have a complicated MySQL query like

SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...

我想要一个具有X，Y和Z列的数据框

and I want a Dataframe with Columns X,Y and Z

我弄清楚了如何将整个表加载到Spark中，我可以将它们全部加载，然后在那里进行连接和选择.但是，这是非常低效的.我只想加载我的SQL查询生成的表.

I figured out how to load entire tables into Spark, and I could load them all, and then do the joining and selection there. However, that is very inefficient. I just want to load the table generated by my SQL query.

这是我当前的代码近似值，这不起作用. Mysql-connector有一个选项"dbtable"，可用于加载整个表.我希望可以通过某种方式指定查询

Here is my current approximation of the code, that doesn't work. Mysql-connector has an option "dbtable" that can be used to load a whole table. I am hoping there is some way to specify a query

  val df = sqlContext.format("jdbc").
    option("url", "jdbc:mysql://localhost:3306/local_content").
    option("driver", "com.mysql.jdbc.Driver").
    option("useUnicode", "true").
    option("continueBatchOnError","true").
    option("useSSL", "false").
    option("user", "root").
    option("password", "").
    sql(
"""
select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim o n dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100
"""
    ).load()

推荐答案

我在这里通过Spark SQL进行批量数据迁移

dbname参数可以是括号中带有别名的任何查询.因此，就我而言，我需要这样做:

The dbname parameter can be any query wrapped in parenthesis with an alias. So in my case, I need to do this:

val query = """
  (select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
    join DialogLine as dl on dl.DialogID=d.DialogID
    join DialogLineWordInstanceMatch as dlwim on dlwim.DialogLineID=dl.DialogLineID
    join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
    join WordRoot as wr on wr.WordRootID=wi.WordRootID
    where d.InSite=1 and dl.Active=1
    limit 100) foo
"""

val df = sqlContext.format("jdbc").
  option("url", "jdbc:mysql://localhost:3306/local_content").
  option("driver", "com.mysql.jdbc.Driver").
  option("useUnicode", "true").
  option("continueBatchOnError","true").
  option("useSSL", "false").
  option("user", "root").
  option("password", "").
  option("dbtable",query).
  load()

如预期的那样，将每个表作为其自己的数据框加载并将其加入Spark效率很低.

As expected, loading each table as its own Dataframe and joining them in Spark was very inefficient.

这篇关于从SQL查询创建Spark数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从SQL查询创建Spark数据框 [英] Create Spark Dataframe from SQL Query

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

从SQL查询创建Spark数据框 [英] Create Spark Dataframe from SQL Query

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭