火花,标量& jdbc-如何限制记录数 [英] spark, scala & jdbc - how to limit number of records
问题描述
是否有一种方法可以限制使用spark sql 2.2.0从jdbc源获取的记录数?
Is there a way to limit the number of records fetched from the jdbc source using spark sql 2.2.0?
我正在处理一项任务,将200M以上的大量记录从一个MS Sql Server表移动(并转换)到另一个:
I am dealing with a task of moving (and transforming) a large number of records >200M from one MS Sql Server table to another:
val spark = SparkSession
.builder()
.appName("co.smith.copydata")
.getOrCreate()
val sourceData = spark
.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", jdbcSqlConnStr)
.option("dbtable", sourceTableName)
.load()
.take(limit)
虽然有效,但显然首先要从数据库中加载所有200M条记录,首先要花18分钟的时间,然后将我希望用于测试和开发目的的有限数量的记录返回给我.
While it works, it is clearly first loading all the 200M records from the database first taking its sweet 18 min and then returns me the limited number of records I desire for testing and development purposes.
切换take(...)和load()会产生编译错误.
Switching around take(...) and load() produces compilation error.
我很高兴有一些方法可以将样本数据复制到较小的表中,使用SSIS或其他etl工具.
I appreciate there are ways to copy sample data to a smaller table, use SSIS, or alternative etl tools.
我真的很好奇,是否有一种方法可以使用spark,sql和jdbc来实现我的目标.
I am really curious whether there is a way to achieve my goal using spark, sql and jdbc.
推荐答案
对于限制的下载行数,可以使用SQL查询代替"dbtable"中的表名,文档中的描述: https://spark.apache.org/docs/latest/sql- programming-guide.html
For limit number of downloaded rows, SQL query can be used instead of table name in "dbtable", description in documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html
例如,可以在查询"where"条件中指定条件,例如使用服务器特定的功能来限制行数(例如Oracle中的"rownum").
In query "where" condition can be specified, for example, with server specific features for limit number of rows (like "rownum" in Oracle).
这篇关于火花,标量& jdbc-如何限制记录数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!