USQL-如何在USQL中选择两个字符串行之间的所有行 [英] USQL - How To Select All Rows Between Two String Rows in USQL

查看:65
本文介绍了USQL-如何在USQL中选择两个字符串行之间的所有行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我完整的任务说明:

Here is my complete task description:

我必须使用u-sql从多个文件中提取数据并将其输出到csv文件中.每个输入文件都包含基于某些字符串行的多个报告("START OF ..."和"END OF ..."用作报告分隔符).这是单个源(输入)文件的示例(数据格式):

I have to extract data from multiple files using u-sql and output it into csv file. Every input file contains multiple reports based on some string rows ("START OF ..." and "END OF ..." working as report separator). Here is an example (data format) of a single source (input) file :

START OF DAILY ACCOUNT
some data 1
some data 2
some data 3
some data n
END OF DAILY ACCOUNT
START OF LEDGER BALANCE
some data 1
some data 2
some data 3
some data 4
some data 5
some data n
END OF LEDGER BALANCE
START OF DAILY SUMMARY REPORT
some data 1
some data 2
some data 3
some data n
END OF DAILY SUMMARY REPORT

所以现在我的问题是,如何获取所有文件的"START OF ..."和"END OF ..."行之间的记录?

So now my question is that how can I fetch records between "START OF ..." and "END OF ..." rows for all files?

我最后想要这样的东西:

I want something like this at the end :

@dailyAccountResult = [select all rows between "START OF DAILY ACCOUNT" and "END OF DAILY ACCOUNT" rows]

@ledgerBalanceResult = [select all rows between "START OF LEDGER BALANCE" and "END OF LEDGER BALANCE" rows]

@dailySummaryReportResult = [select all rows between "START OF DAILY SUMMARY REPORT" and "END OF DAILY SUMMARY REPORT" rows]

我需要为此编写自定义提取器吗?如果是的话,请建议我怎么做.

Do I need to write custom extractor for this? If yes then please suggest me how.

推荐答案

我认为,使用不带自定义提取器的普通U-SQL可以实现这一点.我根据您的示例数据创建了一个简单示例:

I think this is possible using normal U-SQL without a custom extractor. I have created a simple example based on your sample data:

// Get raw input
@input =
    EXTRACT rawData string
    FROM "/input/input36.txt"
    USING Extractors.Tsv();


// Add a row number and break out the section;
// Get all [START OF ...] and [END OF ...] blocks and pair them.
// !!WARNING code assumes there are no duplicate sections, ie can not be more than one DAILY ACCOUNT section for example
@working =
    SELECT ROW_NUMBER() OVER() AS rn,
           System.Text.RegularExpressions.Regex.Match(rawData, "(START OF|END OF) (?<sectionName>.+)").Groups["sectionName"].ToString() AS sectionName,
           *
    FROM @input;


// Work out the section boundaries
@sections =
    SELECT sectionName,
           MIN(rn) AS startRn,
           MAX(rn) AS endRn,
           COUNT( * ) AS records
    FROM @working
    WHERE sectionName != ""
    GROUP BY sectionName;


// Create the output
@output =
    SELECT s.sectionName,
           i.rn == s.startRn ? 1 : 0 AS isStartSection,
           i.rn == s.endRn ? 1 : 0 AS isEndSection,
           i.rawData
    FROM @sections AS s
         CROSS JOIN
             @working AS i
    WHERE i.rn BETWEEN s.startRn AND s.endRn;


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

我的结果:

现在每个部分都用一个部分名称标记,您可以轻松地将数据分配给不同的变量,并可以选择包含页眉/页脚行,例如

Now each section is tagged with a section name, you can easily assign the data to different variables and optionally include header/footer rows, eg

@dailyAccount =
    SELECT rawData
    FROM @output
    WHERE sectionName == "DAILY ACCOUNT"
          AND isStartSection == 0
          AND isEndSection == 0;

尝试一下,让我知道你的生活.

Give it a try and let me know how you get on.

这篇关于USQL-如何在USQL中选择两个字符串行之间的所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆