在另一个文件中查找一个文件中的字符串并输出某些列 [英] Find strings in one file in another and output certain columns

查看:124
本文介绍了在另一个文件中查找一个文件中的字符串并输出某些列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含CampaignName和ID的文件.这两个字段由管道|分隔. ID由空格分隔.我想在文件中找到所有包含ID的行(以þ分隔),并按名称将这些行输出到单独的文件中.该文件通常为4-7 GB,有时更大.

I have a file that contains CampaignNames and IDs. The two fields are separated by a pipe |. The IDs are separated by a space. I want to find all rows in a file (thorpe þ delimited) that contain the IDs, and output those rows into separate files per name. This file is usually 4-7 GB, sometimes larger.

campaigns.txt:

Name|NameID
FirstName|123 212 445 39
SecondName|313 939
ThirdName|219

数据ID文件:

DateþIDþCode
10-22-14þ123þAbc
10-24-16þ212þPow
09-18-15þ219

所以我要创建3个文件. FirstName.txt包含2行. SecondName.txt包含0行. ThirdName.txt包含1行.

So I would want 3 files created. FirstName.txt contains 2 rows. SecondName.txt contains 0 rows. ThirdName.txt contains 1 row.

我将各种来源的一些代码拼凑在一起,并提出了这个方案.但是,我想知道是否有比不得不多次读取数据文件更好的方法.有什么想法吗?

I cobbled together some code from various sources and came up with this. However, I'm wondering if there's a better way than having to read through the data file multiple times. Any thoughts out there?

$campaigns = Import-Csv "campaigns.txt" -Delimiter "|"
$datafile = "5282_10-19-2016"
$encoding = [Text.Encoding]::GetEncoding('iso-8859-1')

echo "Starting.."
Get-Date -Format g

foreach ($campaign in $campaigns) {
    $campaignname = $campaign.CampaignName
    $campaignids = $campaign.CampaignID.split(" ")
    echo "Looking for $campaignname - $campaignids"
    $writer = New-Object System.IO.StreamWriter($campaignname + "_filtered.txt")
    foreach ($campaignid in $campaignids) {
        $datareader = New-Object System.IO.StreamReader($datafile, $encoding)
        while ($dataline = $datareader.ReadLine()) {
            if ($dataline -match $campaignid) {
                $data = $dataline.Split("þ")
                $writer.WriteLine('{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}', $data[0], $data[3], $data[5], $data[8], $data[12], $data[14], $data[19], $data[20])
            }
        }
    }
    $writer.Close()
}

echo "Done!"
Get-Date -Format g

推荐答案

仅处理一次巨大的数据文件.
从campaign.txt构建的哈希表中选择广告系列名称.
假设没有太多广告系列(例如少于1000个)写入到StreamWriter中.

Process the huge datafile just once.
Pick the campaign names from a hashtable built from campaign.txt.
Assuming there are not many campaigns (say, less than 1000) write to as many StreamWriters.

$campaignByID = @{}
foreach ($c in (Import-Csv 'campaigns.txt' -Delimiter '|')) {
    foreach ($id in ($c.CampaignID -split ' ')) {
        $campaignByID[$id] = $c.CampaignName
    }
}

$campaignWriters = @{}
$datareader = New-Object IO.StreamReader($datafile, $encoding)
while (!$datareader.EndOfStream) {
    $data = $datareader.ReadLine().Split('þ')
    $campaignName = $campaignByID[$data[1]]
    if ($campaignName) {
        $writer = $campaignWriters[$campaignName]
        if (!$writer) {
            $writer = $campaignWriters[$campaignName] =
                New-Object IO.StreamWriter($campaignName + '_filtered.txt')
        }
        $writer.WriteLine(($data[0,3,5,8,12,14,19,20] -join '|'))
    }
}

$datareader.Close()
foreach ($writer in $campaignWriters.Values) {
    $writer.Close()
}

要显示进度,请使用基于$datareader.BaseStream.Position / $datareader.BaseStream.Length * 100Write-Progress,但不要对每个数据文件行都这样做,因为这会减慢处理速度,请每1秒执行一次,例如,使用datetime变量:更新它一秒钟过去并显示进度.

To display progress use Write-Progress based on $datareader.BaseStream.Position / $datareader.BaseStream.Length * 100 but don't do it for every datafile line because it'll slow down the processing, do it every 1 second, for example, using a datetime variable: update it when a second has elapsed and display the progress.

这篇关于在另一个文件中查找一个文件中的字符串并输出某些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆