通过JDBC集成Spark SQL和Apache Drill [英] Integrating Spark SQL and Apache Drill through JDBC

查看:134
本文介绍了通过JDBC集成Spark SQL和Apache Drill的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Apache Drill从CSV数据(在HDFS上)执行查询的结果创建一个Spark SQL DataFrame。

I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:

Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");

DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();

Spark SQL执行两个查询:第一个获取模式,第二个获取实际数据:

Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:

SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0

SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)

第一个是成功的,但第二个Spark将字段封装在双引号内,这是Drill不支持的,所以查询失败。

The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.

有人设法使这个集成工作吗?

Did someone managed to get this integration working?

谢谢!

Thank you!

推荐答案

您可以为此添加JDBC Dialect并在使用jdbc连接器之前注册方言

you can add JDBC Dialect for this and register the dialect before using jdbc connector

case object DrillDialect extends JdbcDialect {

  def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")

  override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
    return colName
  }

  def instance = this
}

JdbcDialects.registerDialect(DrillDialect)

这篇关于通过JDBC集成Spark SQL和Apache Drill的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆