将SQL Server表导出到多个部件文件 [英] Exporting SQL Server Table to multiple part files
问题描述
我需要导出一个相当大的SQL Server表〜100GB到CSV文件。
我看到BCP有一个batch_size参数,但这仍然写入所有的csv文件,而不是一个单独的csv文件,它应该是多个文件,数据到单个文件?有没有其他免费的实用程序来做我需要的?在哪里可以指定文件的大小以字节或行数?
对于上下文位,这是数据可以与Hive / Hadoop平台中的其他源组合,因此如果有更好的方式导出数据打开建议。
不幸的是,BCP的batch_size参数不能控制输出。
我做了这种类型的分割:
1 - 简单但不可重复:创建命令文件.cmd)在表的特定行范围上运行多个 BCP
。这可能需要在表上有 IDENTITY(1,1)
的主键。
bcpSELECT * FROM MyTable WHERE id BETWEEN 0 AND 10000000queryout ...
/ pre>
bcpSELECT * FROM MyTable WHERE id BETWEEN 10000000 AND 20000000queryout ...
2 - 简单和可重复,使用了大量磁盘:
BCP
将整个表导出到单个文件,并使用split
创建任意数量的新文件,每个文件具有给定的字节数(注意:按行分割会更好想法IMO)。使用'Cygwin'(GnuWin32不再维护)安装split
和任何其他所需的实用程序。bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w
split -b 10737418240 C:\MyFile.csv C:\MySplitFile_
生成以下文件
C:\MySplitFile_aaa
C:\MySplitFile_aab
...
3 - 复杂但可重复,需要可能不安全的T-SQL:使用
xp_cmdshell
函数在遍历表的存储过程中调用BCP。DECLARE @loop AS INT;
- 使用WHILE根据需要循环 -
DECLARE @sql AS VARCHAR(MAX);
- 添加生成动态SQL的代码 -
DECLARE @bcp AS VARCHAR(MAX);
SELECT @ bcp ='BCP'+ @ sql +'queryout C:\MyFolder\MyFile _'+ @ loop +'。csv';
FINAL注意:如果您在数据中使用任何NVARCHAR字段您需要使用
-w
标志,并注意输出将为UTF-16LE。我会强烈建议使用iconv
(再次从Cygwin)将其转换为UTF-8,然后再尝试在Hadoop中执行任何操作。 / p>I need to export a fairly large SQL Server table ~100GB to a CSV file. But rather than the output be a single csv file, it should ideally be multiple files say 10 files each 10GB.
I see BCP has a batch_size argument but this still writes all data to a single file? Are there other free utilities for doing what I require? Either where the size of file can be specified in bytes or number of rows?
For bit of context this is so the data can be combined with other sources in a Hive/Hadoop platform, so if there are better ways of exporting the data I'm open for suggestions.
解决方案BCP's batch_size argument does not control the output, unfortunately.
Ways I've done this type of splitting:
1 - Simple but non-repeatable: Create a command file (.cmd) that runs a multiple
BCP
s over the table for specific row ranges. This probably requires anIDENTITY(1,1)
based primary key on the table.bcp "SELECT * FROM MyTable WHERE Id BETWEEN 0 AND 10000000" queryout … bcp "SELECT * FROM MyTable WHERE Id BETWEEN 10000000 AND 20000000" queryout …
2 - Simple and repeatable, uses a lot of disk:
BCP
out the entire table to a single file and usesplit
to create as many new files as needed with a given number of bytes in each (note: splitting by lines would be a better idea IMO). Use 'Cygwin' (GnuWin32 no longer maintained) to installsplit
and any other utilities you want.bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w split -b 10737418240 C:\MyFile.csv C:\MySplitFile_
Generates the following files
C:\MySplitFile_aaa C:\MySplitFile_aab …
3 - Complex but repeatable, requires possibly insecure T-SQL: Use the
xp_cmdshell
function to call BCP inside a stored procedure that iterates through the table.DECLARE @loop AS INT; --Use WHILE to loop as needed-- DECLARE @sql AS VARCHAR(MAX); --Add code to generate dynamic SQL here-- DECLARE @bcp AS VARCHAR(MAX); SELECT @bcp='BCP "'+@sql+'" queryout C:\MyFolder\MyFile_'+@loop+'.csv';
FINAL NOTE: If you are using any NVARCHAR fields in your data then you need to use the
-w
flag and be aware that the output will be in UTF-16LE. I would strongly recommend converting that to UTF-8 usingiconv
(from 'Cygwin' again) before trying to do anything with it in Hadoop.这篇关于将SQL Server表导出到多个部件文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!