如何通过搜索和替换验证大量文件? [英] How can I validate large numbers of files with search and replace?

查看:126
本文介绍了如何通过搜索和替换验证大量文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在验证客户端的HTML源代码,并且对于没有Omittag的图像和输入文件,我收到了很多验证错误。我会手动做,但这个客户端字面上有成千上万的文件,有很多的情况下没有。



这个客户端已经验证了一些img标签(无论出于何种原因)。

只是想知道是否有一个unix命令可以运行,以检查是否没有Omittag来添加它。



我已经完成了简单搜索,并用以下命令替换:

  find。 \! -path'* .svn *'-type f -exec sed -i -n'1h; 1!H; $ {; g; s /< b> /< strong> / g; p}'{} \\ \\; 

但从来没有这么大的东西。任何帮助,将不胜感激。

解决方案

请参阅我在顶部的评论问。



假设您使用的是GNU sed,并且您正试图将 / 的尾部>添加到您的标记中以制作XML -compliant < img /> < input /> ,然后替换命令中的sed表达式这一点,它应该这样做:'1h; 1!H; $ {; g; s / \(img \ | input \)\([^>] * [^ /] \)> / \ 1 \ 2 \ /> / g; p;}'



这里是一个简单的测试文件(SO的着色器做了很奇怪的事情):

  $ cat test.html 
这是< img标签>没有关闭斜线。
这是< img tag />结束斜线。
这是<输入标签>没有关闭斜线。
并且这里一个< input attrib =1
>跨越多条线。
最后一个< input
attrib =1/>结束斜线。

$ sed -n'1h; 1!H; $ {; g; s / \(img\ | input\)\([^>] * [^ /] \ 1 \ 2 \ /> / g; p;}'test.html
这是< img tag />没有关闭斜线。
这是< img tag />结束斜线。
这是一个< input tag />没有关闭斜线。
这里有一个< input attrib =1
/>跨越多条线。
最后一个< input
attrib =1/>结束斜线。

以下是 GNU sed正则表达式语法缓冲如何工作以进行多行搜索/替换 可以使用 Tidy 之类的东西来清理不良的HTML - 这就是我要做的事情比一些简单的搜索/替换更复杂。 Tidy的选项很快就会变得复杂,所以最好用选择的脚本语言(Python,Perl)编写脚本,它调用 libtidy 并设置所需的任何选项。


I am currently validating a client's HTML Source and I am getting a lot of validation errors for images and input files which do not have the Omittag. I would do it manually but this client literally has thousands of files, with a lot of instances where the is not .

This client has validated some img tags (for whatever reason).

Just wondering if there is a unix command I could run to check to see if the does not have a Omittag to add it.

I have done simple search and replaces with the following command:

find . \! -path '*.svn*' -type f -exec sed -i -n '1h;1!H;${;g;s/<b>/<strong>/g;p}' {} \; 

But never something this large. Any help would be appreciated.

解决方案

See questions I asked in comment at top.

Assuming you're using GNU sed, and that you're trying to add the trailing / to your tags to make XML-compliant <img /> and <input />, then replace the sed expression in your command with this one, and it should do the trick: '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}'

Here it is on a simple test file (SO's colorizer doing wacky things):

$ cat test.html
This is an <img tag> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag > without closing slash.
And here one <input attrib="1" 
    > that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

$ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html
This is an <img tag/> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag /> without closing slash.
And here one <input attrib="1" 
    /> that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

Here's GNU sed regex syntax and how the buffering works to do multiline search/replace.

Alternately you could use something like Tidy that's designed for sanitizing bad HTML -- that's what I'd do if I were doing anything more complicated than a couple of simple search/replaces. Tidy's options get complicated fast, so it's usually better to write a script in your scripting language of choice (Python, Perl) that calls libtidy and sets whatever options you need.

这篇关于如何通过搜索和替换验证大量文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆