Spark JDBC中的伪列 [英] Pseudocolumn in Spark JDBC

查看:75
本文介绍了Spark JDBC中的伪列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用查询从MYSQL获取数据,如下所示:

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.format("jdbc")
         .option("url", "jdbc:mysql://10.0.0.192:3306/retail_db")
         .option("driver" ,"com.mysql.jdbc.Driver")
         .option("user", "retail_dba")
         .option("password", "cloudera")
         .option("dbtable", "orders")
         .option("partitionColumn", "order_id")
         .option("lowerBound", "1")
         .option("upperBound", "68883")
         .option("numPartitions", "4")
         .load() 

问题是,我可以在option中使用伪列(例如Oracle中的ROWNUM或DB2中的RRN(employeeno)),在其中指定partitionColumn吗?

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) with option where I specify the partitionColumn ?

如果不是,我们可以指定不是主键的分区列吗?

If not, can we specify a partition column which is not a primary key ?

推荐答案

是否可以使用伪列(例如Oracle中的ROWNUM或DB2中的RRN(employeeno))

can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2)

TL; DR 可能不.

虽然Spark不考虑像PRIMARY KEYUNIQUE这样的约束,但partitionColumn非常重要的要求,该要求在文档中没有明确说明-它必须确定性.

While Spark doesn't consider constraints like PRIMARY KEY or UNIQUE there is very important requirement for partitionColumn, which is not explicitly stated in the documentation - it has to be deterministic.

每个执行者都使用单独的事务来获取自己的数据.如果数字列不确定(稳定,在事务之间保留),则Spark看到的数据状态可能会不一致,并且记录可能会重复或跳过.

Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.

因为ROWNUM的实现通常是易变的(取决于不稳定的排序,并且可能受诸如索引等功能的影响),因此partitionColumn的选择并不安全.出于同样的原因,您不能使用随机数.

Because ROWNUM implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn. For the same reason you cannot use random numbers.

此外,某些供应商可能会进一步限制对伪列的允许操作,使其不适合用作分区列. 例如Oracle ROWNUM

Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM

对大于正整数的ROWNUM值进行条件测试始终为假.

Conditions testing for ROWNUM values greater than a positive integer are always false.

可能会在无提示的情况下失败,从而导致错误的结果.

might fail silently leading to incorrect results.

我们可以指定不是主键的分区列

can we specify a partition column which is not a primary key

是的,只要满足上述条件即可.

Yes, as long it satisfies criteria described above.

这篇关于Spark JDBC中的伪列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆