如何为 Foreach File 枚举器上的 FileSpec 属性设置表达式? [英] How can I set an expression to the FileSpec property on Foreach File enumerator?

查看:22
本文介绍了如何为 Foreach File 枚举器上的 FileSpec 属性设置表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个 SSIS 包来处理包含多年文件的目录中的文件.这些文件都以数字命名,因此为了节省处理所有内容,我想向 SSIS 传递一个最小数字,并且只枚举名称(转换为数字)大于我的最小值的文件.

I'm trying to create an SSIS package to process files from a directory that contains many years worth of files. The files are all named numerically, so to save processing everything, I want to pass SSIS a minimum number, and only enumerate files whose name (converted to a number) is higher than my minimum.

我尝试让 ForEach File 循环枚举所有内容,然后在脚本任务中排除文件,但是在处理数十万个文件时,这太慢了,不适合.

I've tried letting the ForEach File loop enumerate everything and then exclude files in a Script Task, but when dealing with hundreds of thousands of files, this is way too slow to be suitable.

FileSpec 属性允许您指定一个文件掩码来指示您想要在集合中的哪些文件,但我不太明白如何指定一个表达式来使其工作,因为它本质上是一个字符串匹配.

The FileSpec property lets you specify a file mask to dictate which files you want in the collection, but I can't quite see how to specify an expression to make that work, as it's essentially a string match.

如果组件中某处有一个表达式,它基本上表示 我应该枚举吗?- 是/否,那就完美了.我一直在试验下面的表达式,但找不到可以应用它的属性.

If there's an expression within the component somewhere which basically says Should I Enumerate? - Yes / No, that would be perfect. I've been experimenting with the below expression, but can't find a property to which to apply it.

(DT_I4)REPLACE(SUBSTRING(@[User::ActiveFilePath],FINDSTRING(@[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > @[用户::MinIndexId] ?真":假"

(DT_I4)REPLACE( SUBSTRING(@[User::ActiveFilePath],FINDSTRING( @[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > @[User::MinIndexId] ? "True" : "False"

推荐答案

从调查 ForEach 循环在 SSIS 中的工作方式(以创建我自己的循环来解决问题)看来,它的工作方式(就无论如何我都可以看到)是在指定任何掩码之前首先枚举文件集合.如果没有看到 ForEach 循环的底层代码,就很难确切地知道发生了什么,但它似乎是这样做的,在处理超过 10 万个文件时会导致性能下降.

From investigating how the ForEach loop works in SSIS (with a view to creating my own to solve the issue) it seems that the way it works (as far as I could see anyway) is to enumerate the file collection first, before any mask is specified. It's hard to tell exactly what's going on without seeing the underlying code for the ForEach loop but it seems to be doing it this way, resulting in slow performance when dealing with over 100k files.

虽然@Siva 的解决方案非常详细并且绝对是对我最初方法的改进,但它本质上只是相同的过程,除了使用表达式任务来测试文件名,而不是脚本任务(这似乎提供了一些改进).

While @Siva's solution is fantastically detailed and definitely an improvement over my initial approach, it is essentially just the same process, except using an Expression Task to test the filename, rather than a Script Task (this does seem to offer some improvement).

因此,我决定采用完全不同的方法,而不是使用基于文件的 ForEach 循环,而是自己在脚本任务中枚举集合,应用我的过滤逻辑,然后迭代剩余的结果.这就是我所做的:

So, I decided to take a totally different approach and rather than use a file-based ForEach loop, enumerate the collection myself in a Script Task, apply my filtering logic, and then iterate over the remaining results. This is what I did:

在我的脚本任务中,我使用异步 DirectoryInfo.EnumerateFiles 方法,这是大文件集合的推荐方法,因为它允许流式传输,而不必等待整个集合在应用任何逻辑之前创建.

In my Script Task, I use the asynchronous DirectoryInfo.EnumerateFiles method, which is the recommended approach for large file collections, as it allows streaming, rather than having to wait for the entire collection to be created before applying any logic.

代码如下:

public void Main()
{
    string sourceDir = Dts.Variables["SourceDirectory"].Value.ToString();
    int minJobId = (int)Dts.Variables["MinIndexId"].Value;

    //Enumerate file collection (using Enumerate Files to allow us to start processing immediately
    List<string> activeFiles = new List<string>();

    System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() =>
    {
         DirectoryInfo dir = new DirectoryInfo(sourceDir);
         foreach (FileInfo f in dir.EnumerateFiles("*.txt"))
         {
              FileInfo file = f;
              string filePath = file.FullName;
              string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);
              int jobId = Convert.ToInt32(fileName.Substring(0, fileName.IndexOf(".txt")));

              if (jobId > minJobId)
                   activeFiles.Add(filePath);
         }
    });

    //Wait here for completion
    System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask });
    Dts.Variables["ActiveFilenames"].Value = activeFiles;
    Dts.TaskResult = (int)ScriptResults.Success;
}

因此,我枚举集合,在发现文件时应用我的逻辑,并立即将文件路径添加到我的列表中以进行输出.完成后,我将其分配给名为 ActiveFilenames 的 SSIS 对象变量,我将使用该变量作为 ForEach 循环的集合.

So, I enumerate the collection, applying my logic as files are discovered and immediately adding the file path to my list for output. Once complete, I then assign this to an SSIS Object variable named ActiveFilenames which I'll use as the collection for my ForEach loop.

我将 ForEach 循环配置为 ForEach From Variable Enumerator,它现在迭代一个小得多的集合(与我所能做的相比,过滤后的 List仅假设是未过滤的 List 或 SSIS 内置 ForEach File Enumerator 中的类似内容.

I configured the ForEach loop as a ForEach From Variable Enumerator, which now iterates over a much smaller collection (Post-filtered List<string> compared to what I can only assume was an unfiltered List<FileInfo> or something similar in SSIS' built-in ForEach File Enumerator.

所以我的循环中的任务可以专门用于处理数据,因为它在进入循环之前已经被过滤了.尽管它似乎与我的初始包或 Siva 的示例没有太大不同,但在生产中(无论如何,对于这种特殊情况)似乎过滤集合和异步枚举提供了使用内置 ForEach 文件的巨大提升枚举器.

So the tasks inside my loop can just be dedicated to processing the data, since it has already been filtered before hitting the loop. Although it doesn't seem to be doing much different to either my initial package or Siva's example, in production (for this particular case, anyway) it seems like filtering the collection and enumerating asynchronously provides a massive boost over using the built in ForEach File Enumerator.

我将继续研究 ForEach 循环容器,看看是否可以在自定义组件中复制此逻辑.如果我得到这个工作,我会在评论中发布一个链接.

I'm going to continue investigating the ForEach loop container and see if I can replicate this logic in a custom component. If I get this working I'll post a link in the comments.

这篇关于如何为 Foreach File 枚举器上的 FileSpec 属性设置表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆