如何从同一个数据库中读取多个表并将它们保存到自己的 CSV 文件中? [英] How to read many tables from the same database and save them to their own CSV file?

查看:36
本文介绍了如何从同一个数据库中读取多个表并将它们保存到自己的 CSV 文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是连接到 SQL 服务器的工作代码,并将 1 个表保存到 CSV 格式文件.

Below is a working code to connect to a SQL server,and save 1 table to a CSV format file.

conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
sc = new SparkContext(conf)
sqlContext = new SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option("password","xxxx").load()

df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("poc/amitesh/csv")
exit()

我有一个场景,我必须通过 pyspark 代码一次将 4 个表以 CSV 格式保存在 4 个不同的文件中.无论如何我们可以实现目标吗?或者,这些拆分是在 HDFS 块大小级别完成的,因此如果您有一个 300mb 的文件,并且 HDFS 块大小设置为 128,那么您将分别获得 3 个 128mb、128mb 和 44mb 的块?

I ahve a scenario, where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code. Is there anyway we can achieve the objective? Or,these splits are done at the HDFS block size level, so if you have a file of 300mb, and the HDFS block size is set at 128, then you get 3 blocks of 128mb, 128mb and 44mb respectively?

推荐答案

我必须通过 pyspark 代码一次将 4 个表以 CSV 格式保存在 4 个不同的文件中.

where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code.

您必须为数据库中的每个表编写转换(读取和写入)(使用 sqlContext.read.format).

You have to code a transformation (reading and writing) for every table in the database (using sqlContext.read.format).

特定于表的 ETL 管道之间的唯一区别是每个表的 dbtable 选项不同.拥有 DataFrame 后,将其保存到自己的 CSV 文件中.

The only difference between the table-specific ETL pipeline is a different dbtable option per table. Once you have a DataFrame, save it to its own CSV file.

代码可能如下所示(在 Scala 中,因此我将其转换为 Python 作为家庭练习):

The code could look as follows (in Scala so I leave converting it to Python as a home exercise):

val datasetFromTABLE_ONE: DataFrame = sqlContext.
  read.
  format("jdbc").
  option("url","jdbc:sqlserver://DBServer:PORT").
  option("databaseName","xxx").
  option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").
  option("dbtable","TABLE_ONE").
  option("user","xxx").
  option("password","xxxx").
  load()

// save the dataset from TABLE_ONE into its own CSV file
datasetFromTABLE_ONE.write.csv("table_one.csv")

对要保存为 CSV 的每个表重复相同的代码.

Repeat the same code for every table you want to save to CSV.

完成!

另一个解决方案:

当我有 100 张或更多桌子时怎么办?如何为此优化代码?如何在 Spark 中有效地做到这一点?任何并行化?

What when I have 100 or more tables? How to optimize the code for that? How to do it effectively in Spark? Any parallelization?

位于我们用于 ETL 管道的 SparkSession 后面的

SparkContext 是线程安全的,这意味着您可以从多个线程使用它.如果您考虑每个表的线程,这是正确的方法.

SparkContext that sits behind SparkSession we use for the ETL pipeline is thread-safe which means that you can use it from multiple threads. If you think about a thread per table that's the right approach.

您可以生成与表一样多的线程,比如 100 个,然后启动它们.然后 Spark 可以决定执行什么以及何时执行.

You could spawn as many threads as you have tables, say 100, and start them. Spark could then decide what and when to execute.

这是 Spark 使用 Fair Scheduler 所做的事情游泳池.对于这种情况,这不是 Spark 广为人知的值得考虑的功能:

That's something Spark does using Fair Scheduler Pools. That's not very widely known feature of Spark that'd be worth considering for this case:

在给定的 Spark 应用程序(SparkContext 实例)中,如果多个并行作业从不同的线程提交,则它们可以同时运行.在本节中,作业"是指 Spark 操作(例如保存、收集)以及需要运行以评估该操作的任何任务.Spark 的调度程序是完全线程安全的,并支持此用例以启用服务多个请求(例如,多个用户的查询)的应用程序.

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By "job", in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

使用它,您的加载和保存管道可能会变得更快.

Use it and your loading and saving pipelines may get faster.

这篇关于如何从同一个数据库中读取多个表并将它们保存到自己的 CSV 文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆