使用Polybase并行将数据加载到现有表中 [英] Using Polybase to load data into an existing table in parallel

查看:118
本文介绍了使用Polybase并行将数据加载到现有表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用CTAS,我们可以利用Polybase提供的并行性,以高度可扩展和高性能的方式将数据加载到 new 表中。

Using CTAS we can leverage the parallelism that Polybase provides to load data into a new table in a highly scalable and performant way.

有没有一种方法可以使用类似的方法将数据加载到现有表中?该表甚至可能是空的。

Is there a way to use a similar approach to load data into an existing table? The table might even be empty.

创建一个外部表并使用 INSERT INTO ... SELECT * FROM ... -我认为这是通过头节点进行的,因此不是并行的?

Creating an external table and using INSERT INTO ... SELECT * FROM ... - I would assume that this goes through the head node and is therefore not in parallel?

我知道我也可以删除表并使用CTAS重新创建它但是然后我必须再次处理所有元数据(列名,数据类型,分布等)。

I know that I could also drop the table and use CTAS to recreate it but then I have to deal with all the metadata again (column names, data types, distributions, ...).

推荐答案

您可以使用分区切换来执行此操作,尽管请记住不要在Azure SQL数据仓库中使用太多分区。请参阅分区大小调整指南

You could use partition switching to do this, although remember not to use too many partitions with Azure SQL Data Warehouse. See 'Partition Sizing Guidance' here.

由于不支持检查约束,因此源表必须使用与目标表相同的分区方案。

Bear in mind check constraints are not supported so the source table has to use the same partition scheme as the target table.

具有分区和切换语法的完整示例:

Full example with partitioning and switch syntax:

-- Assume we have a file with the values 1 to 100 in it.

-- Create an external table over it; will have all records in
IF NOT EXISTS ( SELECT * FROM sys.schemas WHERE name = 'ext' )
EXEC ( 'CREATE SCHEMA ext' )
GO


-- DROP EXTERNAL TABLE ext.numbers
IF NOT EXISTS ( SELECT * FROM sys.external_tables WHERE object_id = OBJECT_ID('ext.numbers') )
CREATE EXTERNAL TABLE ext.numbers (
    number          INT             NOT NULL
    )
WITH (
    LOCATION = 'numbers.csv',
    DATA_SOURCE = eds_yourDataSource, 
    FILE_FORMAT = ff_csv
);
GO

-- Create a partitioned, internal table with the records 1 to 50
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers

CREATE TABLE dbo.numbers
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED INDEX ( number ), 
    PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
    )
AS 
SELECT * 
FROM ext.numbers
WHERE number Between 1 And 50;
GO

-- DBCC PDW_SHOWPARTITIONSTATS ('dbo.numbers')


-- CTAS the second half of the external table, records 51-100 into an internal one.
-- As check contraints are not available in SQL Data Warehouse, ensure the switch table
-- uses the same scheme as the original table.
IF OBJECT_ID('dbo.numbers_part2') IS NOT NULL DROP TABLE dbo.numbers_part2

CREATE TABLE dbo.numbers_part2
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED INDEX ( number ),
    PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
    )
AS 
SELECT *
FROM ext.numbers
WHERE number > 50
GO


-- Partition switch it into the original table
ALTER TABLE dbo.numbers_part2 SWITCH PARTITION 2 TO dbo.numbers PARTITION 2;


SELECT *
FROM dbo.numbers
ORDER BY 1;

这篇关于使用Polybase并行将数据加载到现有表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆