如何解析大字符串U-SQL正则表达式 [英] How to parse big string U-SQL Regex

查看:113
本文介绍了如何解析大字符串U-SQL正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大字符串的大CSV.我想用U-SQL解析它们.

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)") AS p
FROM
    (VALUES(1)) AS fe(n);

@t2 = 
SELECT
    p.Groups["ID"].Value AS gads_id,
    p.Groups["T"].Value AS gads_t,
    p.Groups["S"].Value AS gads_s
FROM
    @t1;

OUTPUT @t
TO "/inhabit/test.csv"
USING Outputters.Csv();

严重性代码描述项目文件行抑制状态 错误E_CSC_USER_INVALIDCOLUMNTYPE: "System.Text.RegularExpressions.Match"不能用作列类型.

我知道如何通过EXPLODE/CROSS APPLY/GROUP BY以SQL方式进行操作.但是也许没有这些舞蹈就可以做到吗?

又一次更新

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["ID"].Value AS id,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["T"].Value AS t,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["S"].Value AS s
FROM
    (VALUES(1)) AS fe(n);

OUTPUT @t1
TO "/inhabit/test.csv"
USING Outputters.Csv();

此警告功能正常.但是有一个问题.正则表达式每行会评估3次吗?是否有任何机会暗示U-SQL引擎-Regex.Match函数是确定性的.

解决方案

您可能应该使用比Regex.Match更有效的工具.但要回答您的原始问题:

System.Text.RegularExpressions.Match不属于内置U- SQL类型.

因此,您需要将其转换为内置类型,例如stringSqlArray<string>或将其包装为 udt 提供了IFormatter使其成为用户定义的类型./p>

I have got a big CSVs that contain big strings. I wanna parse them in U-SQL.

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)") AS p
FROM
    (VALUES(1)) AS fe(n);

@t2 = 
SELECT
    p.Groups["ID"].Value AS gads_id,
    p.Groups["T"].Value AS gads_t,
    p.Groups["S"].Value AS gads_s
FROM
    @t1;

OUTPUT @t
TO "/inhabit/test.csv"
USING Outputters.Csv();

Severity Code Description Project File Line Suppression State Error E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match' cannot be used as column type.

I know how to do it in a SQL way with EXPLODE/CROSS APPLY/GROUP BY. But may be it is possible to do without these dances?

One more update

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["ID"].Value AS id,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["T"].Value AS t,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["S"].Value AS s
FROM
    (VALUES(1)) AS fe(n);

OUTPUT @t1
TO "/inhabit/test.csv"
USING Outputters.Csv();

This wariant works fine. But there is a question. Will the regex evauated 3 times per row? Does exists any chance to hint U-SQL engine - the function Regex.Match is deterministic.

解决方案

You should probably be using something more efficient than Regex.Match. But to answer your original question:

System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.

Thus you would need to convert it into a built-in type, such as string or SqlArray<string> or wrap it into a udt that provides an IFormatter to make it a user-defined type.

这篇关于如何解析大字符串U-SQL正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆