Azure SQL数据仓库是否可以拆分字符串? [英] Does Azure SQL Data Warehouse have a way to split strings?

查看:69
本文介绍了Azure SQL数据仓库是否可以拆分字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

做一些研究,我发现在Azure SQL数据仓库中没有很好的选项来拆分字符串.它没有新的STRING_SPLIT()函数或OPENJSON()函数.它也不允许用户定义函数中的SELECT语句尝试创建自己的社区,就像社区中创建的许多自定义拆分器函数一样.

Doing some research, I see that there are no good options to split strings in Azure SQL Data Warehouse. It doesn't have the new STRING_SPLIT() function or OPENJSON() function. It also doesn't allow SELECT statements in user defined functions to try and create your own like many of the custom splitter functions the community has made.

因此,我想想我要提出一个问题:SQL数据仓库是否有拆分字符串的方法?在这里采取的最佳选择是什么?

Thus, I figured I would pose the questions: Does SQL Data Warehouse have ways to split strings and what are the best options to take here?

用例

您在SQL表中具有一个值为"My_Value_Is_Good"的字段.目的是使用SELECT语句中的分隔符下划线或最多写入新表的方式将每个段划分为单独的字段.

You have a field in a SQL table with the value, "My_Value_Is_Good". The objective is to split out each segment into separate fields using the delimiter underscore in either a SELECT statement or at most, written to a new table.

我使用过的解决方案

对我来说主要的就是在将数据放入数据仓库之前对其进行转换.我使用Python解析数据.但是,较大的数据集确实会减慢此速度,并且一旦在系统中也将其隔离到特定记录中.

The main one for me is just transforming the data before it lands in the data warehouse. I do this using Python to parse out the data. However, bigger datasets do slow this down and isolate this more to specific records once in the system too.

推荐答案

2019年7月更新-STRING_SPLIT现在可根据

Update Jul 2019 - STRING_SPLIT is now available in Azure SQL Data Warehouse as per here. So in my example below, the code would be more like this:

DECLARE @delimiter CHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT *
FROM dbo.guids g
    CROSS APPLY STRING_SPLIT ( xguid, @delimiter );

与常规SQL Server或Azure SQL数据库相比,

Azure SQL数据仓库具有减少的T-SQL表面积.它没有任何花哨的技巧,例如STRING_SPLIT,表值函数,CLR,XML;甚至不允许使用游标.实际上,对于该主题的必备文章之一(SQL 2016之前)中的所有技术,'

Azure SQL Data Warehouse has a reduced T-SQL surface area as compared with normal SQL Server or Azure SQL Database. It does not have any of the fancy tricks such as STRING_SPLIT, table-valued functions, CLR, XML; even cursors are not allowed. In fact for all the techniques in one of the go-to articles on this topic (pre-SQL 2016) 'Split strings the right way - or the next best way', you can't use any of them, with the exception of the numbers table.

因此,我们需要一些程序上的操作,避免任何形式的循环.我已经从上面的文章中获得灵感,使用了测试数据脚本的改编版和此方法:

Therefore we need something a bit more procedural, avoiding loops of any kind. I have used the above article for inspiration, used an adapted version of the test data script and this approach:

-- Create one million guids
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
IF OBJECT_ID('dbo.guids_split') IS NOT NULL DROP TABLE dbo.guids_split
IF OBJECT_ID('dbo.guids') IS NOT NULL DROP TABLE dbo.guids
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
GO


CREATE TABLE dbo.Numbers (
    Number  INT NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,     --!!TODO try distibuting?
    CLUSTERED INDEX ( Number )
)
GO


DECLARE @UpperLimit INT = 1000000;

;WITH n AS
(
    SELECT
        x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
)
SELECT x
INTO #tmp
FROM n
WHERE x BETWEEN 1 AND @UpperLimit
GO

INSERT INTO dbo.Numbers ( Number )
SELECT x
FROM #tmp
GO


CREATE TABLE dbo.guids (
    rn  INT IDENTITY,
    xguid   CHAR(36) NOT NULL
)
WITH
(
    DISTRIBUTION = HASH(xguid),
    CLUSTERED COLUMNSTORE INDEX
)
GO

INSERT INTO dbo.guids ( xguid )
SELECT NEWID() xguid
FROM dbo.Numbers
GO -- 10    -- scale up 10 to 100, 1,000 etc

ALTER INDEX ALL ON dbo.guids REBUILD 
GO


-- Create the stats
CREATE STATISTICS _st_numbers_number ON dbo.numbers (number);
CREATE STATISTICS _st_guids_rn ON dbo.guids (rn);
CREATE STATISTICS _st_guids_xguid ON dbo.guids (xguid);
GO
-- multi-col stat?
:exit


-- NB The length of the guid; so we don't have to use VARCHAR(MAX)
DECLARE @delimiter VARCHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT
    s.rn,
    n.Number n,
    originalid AS xguid,
    LTRIM( RTRIM( SUBSTRING( s.xguid, n.Number + 1, CHARINDEX( @delimiter, s.xguid, n.Number + 1 ) - n.Number - 1 ) ) ) AS split_value
FROM (
    SELECT
        rn,
        xguid AS originalid,
        CAST( CAST( @delimiter AS VARCHAR(38) ) + CAST( xguid AS VARCHAR(38) ) + CAST( @delimiter AS VARCHAR(38) ) AS VARCHAR(38) ) AS xguid
        FROM dbo.guids
        ) s
    CROSS JOIN dbo.Numbers n
WHERE n.Number < LEN( s.xguid )
  AND SUBSTRING( s.xguid, n.Number, 1 ) = @delimiter;
GO


/*
SELECT TOP 10 * FROM dbo.guids ORDER BY rn;

SELECT *
FROM dbo.guids_split
WHERE rn In ( SELECT TOP 10 rn FROM dbo.guids ORDER BY rn )
ORDER BY 1, 2;
GO

*/

该脚本现在已经在ADW上进行了测试,并且可以令人满意地处理超过1亿条记录.这仅在DWU 400上运行了不到4分钟(至少在我添加了统计信息并删除了varchar(max):之后).但是,guids只是一个人为的例子,因为数据的大小是统一的,并且总是只有5个部分可以拆分.

The script is now tested on ADW and worked satisfactorily over 100 million records. This ran in under 4 mins at only DWU 400 (at least once I had added the stats and removed the varchar(max) : ). The guids is however a slightly artificial example as the data is uniform in size and always only 5 parts to split.

从Azure SQL数据仓库中获得良好的性能实际上与通过良好的哈希分发键来最大程度地减少数据移动有关.因此,请发布一些实际的示例数据.

Getting good performance out of Azure SQL Data Warehouse is really to do with minimising data movement via a good hash distribution key. Therefore please post some realistic sample data.

另一种选择是Azure Data Lake Analytics. ADLA支持联合查询以查询数据在其中的位置",因此您可以使用U-SQL查询原始表,使用本机.net方法对其进行拆分,然后输出 可以使用Polybase轻松导入.让我知道您是否需要更多有关此方法的帮助,我将举一个例子.

The other alternative is Azure Data Lake Analytics. ADLA supports federated queries to "query data where it lives", so you could query the original table using U-SQL, split it using the native .net method and output a which could easily be imported using Polybase. Let me know if you need more help with this approach and I'll do up an example.

此后,SQLCat团队发布了有关使用SQL数据仓库的反模式的文章,此类字符串处理可能被视为示例.请阅读这篇文章:

The SQLCat team have since published this article on anti-patterns with SQL Data Warehouse, which this type of string processing might be considered an example of. Please read this article:

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆