如何从同一个数据库中读取许多表并将它们保存到自己的CSV文件中? [英] How to read many tables from the same database and save them to their own CSV file?

查看:78
本文介绍了如何从同一个数据库中读取许多表并将它们保存到自己的CSV文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是一个工作代码,用于连接到SQL Server,并将1个表保存到CSV格式文件中.

Below is a working code to connect to a SQL server,and save 1 table to a CSV format file.

conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
sc = new SparkContext(conf)
sqlContext = new SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option("password","xxxx").load()

df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("poc/amitesh/csv")
exit()

我有一种情况,在这种情况下,我必须通过pyspark代码一次以CSV格式将来自同一数据库的4个表保存为4个不同的文件.无论如何,我们可以实现目标吗?或者,这些拆分是在HDFS块大小级别上完成的,所以如果您有一个300mb的文件,并且HDFS块大小设置为128,那么您将分别获得3个块,分别为128mb,128mb和44mb?

I ahve a scenario, where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code. Is there anyway we can achieve the objective? Or,these splits are done at the HDFS block size level, so if you have a file of 300mb, and the HDFS block size is set at 128, then you get 3 blocks of 128mb, 128mb and 44mb respectively?

推荐答案

其中我必须通过pyspark代码一次以CSV格式将来自同一数据库的4个表保存为4个不同的文件.

where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code.

您必须为数据库中的每个表编码一个转换(读取和写入)(使用 sqlContext.read.format ).

You have to code a transformation (reading and writing) for every table in the database (using sqlContext.read.format).

特定于表的ETL管道之间的唯一区别是每个表使用不同的 dbtable 选项.有了DataFrame后,将其保存到自己的CSV文件中.

The only difference between the table-specific ETL pipeline is a different dbtable option per table. Once you have a DataFrame, save it to its own CSV file.

代码可能如下所示(在Scala中,因此我将其转换为Python作为家庭练习):

The code could look as follows (in Scala so I leave converting it to Python as a home exercise):

val datasetFromTABLE_ONE: DataFrame = sqlContext.
  read.
  format("jdbc").
  option("url","jdbc:sqlserver://DBServer:PORT").
  option("databaseName","xxx").
  option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").
  option("dbtable","TABLE_ONE").
  option("user","xxx").
  option("password","xxxx").
  load()

// save the dataset from TABLE_ONE into its own CSV file
datasetFromTABLE_ONE.write.csv("table_one.csv")

为要保存为CSV的每个表重复相同的代码.

Repeat the same code for every table you want to save to CSV.

完成!

该解决方案要求另一个:

The solution begs another:

当我拥有100张或更多张桌子时该怎么办?如何为此优化代码?如何在Spark中有效地做到这一点?任何并行化吗?

What when I have 100 or more tables? How to optimize the code for that? How to do it effectively in Spark? Any parallelization?

我们用于ETL管道的 SparkSession 后面的

SparkContext 是线程安全的,这意味着您可以在多个线程中使用它.如果您想到每个表都有一个线程,那是正确的方法.

SparkContext that sits behind SparkSession we use for the ETL pipeline is thread-safe which means that you can use it from multiple threads. If you think about a thread per table that's the right approach.

您可以生成与表一样多的线程(例如100个)并启动它们.然后,Spark可以决定执行什么内容和执行时间.

You could spawn as many threads as you have tables, say 100, and start them. Spark could then decide what and when to execute.

Spark使用公平的计划程序泳池.对于这种情况,Spark并不是一个广为人知的功能,值得考虑:

That's something Spark does using Fair Scheduler Pools. That's not very widely known feature of Spark that'd be worth considering for this case:

在给定的Spark应用程序(SparkContext实例)中,如果多个并行作业是从单独的线程提交的,则它们可以同时运行.在本节中,工作"指的是Spark动作(例如保存,收集)以及需要运行以评估该动作的所有任务.Spark的调度程序是完全线程安全的,并支持该用例,以启用能够处理多个请求(例如,针对多个用户的查询)的应用程序.

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By "job", in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

使用它,您的加载和保存管道可能会更快.

Use it and your loading and saving pipelines may get faster.

这篇关于如何从同一个数据库中读取许多表并将它们保存到自己的CSV文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆