两个其他字符串之间的grep字符串作为分隔符 [英] grep string between two other strings as delimiters

查看:288
本文介绍了两个其他字符串之间的grep字符串作为分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须报告某个CSS类在我们的页面内容中出现了多少次(超过10k页)。问题是,页眉和页脚包含该类,所以一个grep会返回每一页。



那么,我该如何grep内容?



编辑:我正在寻找如果一个页面< code>< code>< main> list-unstyled $ c>和< / main>



那么我是否使用该grep的正则表达式?或者我需要使用PowerShell来获得更多功能?



我拥有grep和PowerShell,但是如果这是我唯一的选择,我可以使用便携式软件。

理想情况下,我会得到一个带有页面和行号的报表(.txt或.csv),类中显示的只是一个页面列表, 。

编辑:进度



我现在在PowerShell中有这个

  $ files = get-childitem -recurse -path w:\test\york\ -Filter * .html 
foreach($ file in $ files )
{
$ htmlfile = [System.IO.File] :: ReadAllText($ file.fullName)
$ regex =(?m)< main([\ w \\ \\W] *)< /主>中
if($ htmlfile -match $ regex){
$ middle = $ matches [1]
[regex] ::匹配($ middle,list-unstyled)
Write-Host $ file.fullName在中间匹配:
}
}

我用这个命令运行 .\FindStr.ps1 | Export-csv C:\Tools\text.csv



它在控制台中输出字符串的文件名和路径,向CSV添加任何内容。我怎样才能得到它?

解决方案

什么 Ansgar Wiechers的回答表示很好的建议。不要串搜索html文件。我没有问题,但值得注意的是,并非所有的html文件都是相同的,并且正则表达式搜索可能会产生有缺陷的结果。如果存在知道文件内容结构的工具,则应使用它们。



我想采取一种简单的方法,报告所有文件中包含足够的文本 list-unstyled 在给定目录中的所有html文件中。你期望有2?所以如果超过这个数字就足够了。我会做一个更复杂的正则表达式的解决方案,但既然你想要的行号,以及我想出了这种妥协。

  $ pattern =list-unstyled
Get-ChildItem C:\temp -Recurse -Filter *。 html |
Select-String $ pattern |
Group-Object Path |
Where-Object {$ _。Count -gt 2} |
ForEach-Object {
$ props = @ {
File = $ _。Group | Select-Object -First 1 -ExpandProperty Path
PatternFound =($ _。Group | Select-Object -ExpandProperty LineNumber)-join;
}

New-Object -TypeName PSCustomObject -Property $ props
}

Select-String 是一个 grep 类似工具,可以搜索文件的字符串。它会在文件中报告找到的行号,我为什么在这里使用它。



您应该在PowerShell控制台上看到如下所示的输出。

  File PatternFound 
---- ------------
C:\temp\content.html 4; 11; 54

其中4,11,54是找到文本的行。代码筛选出行数小于3的结果。因此,如果您希望在页眉和页脚中预留一次,则应排除这些结果。

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.

So, how do I grep for content?

EDIT: I am looking for if a page has list-unstyled between <main> and </main>

So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?

I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.

Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.

EDIT: Progress

I now have this in PowerShell

$files = get-childitem -recurse -path w:\test\york\ -Filter *.html 
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) { 
    $middle=$matches[1] 
    [regex]::Matches($middle,"list-unstyled")
    Write-Host $file.fullName has matches in the middle:
}
}

Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv

it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?

解决方案

What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.

I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.

$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html | 
    Select-String $pattern | 
    Group-Object Path | 
    Where-Object{$_.Count -gt 2} | 
    ForEach-Object{
        $props = @{
            File = $_.Group | Select-Object -First 1 -ExpandProperty Path
            PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
        }

        New-Object -TypeName PSCustomObject -Property $props
    }

Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.

You should get output that looks like this on your PowerShell console.

File                                                                           PatternFound                                                                  
----                                                                           ------------                                                                  
C:\temp\content.html                                                           4;11;54

Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.

这篇关于两个其他字符串之间的grep字符串作为分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆