匹配 sed 中的任何字符(包括换行符) [英] Match any character (including newlines) in sed
问题描述
我有一个 sed 命令,我想在一个巨大的、可怕的、丑陋的 HTML 文件上运行它,该文件是从 Microsoft Word 文档创建的.它应该做的就是删除字符串的任何实例
I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string
style='text-align:center; color:blue;
exampleStyle:exampleValue'
我试图修改的 sed 命令是
The sed command that I am trying to modify is
sed "s/ style='[^']*'//" fileA > fileB
它工作得很好,除了每当匹配文本中有一个新行时,它都不匹配.sed 是否有修饰符,或者我可以做些什么来强制匹配任何字符,包括换行符?
It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?
我知道正则表达式在 XML 和 HTML 方面很糟糕,等等,但在这种情况下,字符串模式是格式良好的,因为样式属性总是以单引号开头并以单引号结尾.因此,如果我能解决换行符问题,我就可以通过这条命令将 HTML 的大小减少 50% 以上.
I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.
最后,结果证明思南 Ünür 的 perl 脚本效果最好.它几乎是瞬间完成的,并将文件大小从 2.3 MB 减少到 850k.好的 ol' Perl...
In the end, it turned out that Sinan Ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...
推荐答案
sed
逐行检查输入文件,这意味着,据我所知,您想要的在 中是不可能的sed
.
sed
goes over the input file line by line which means, as I understand, what you want is not possible in sed
.
不过,您可以使用以下 Perl 脚本(未经测试):
You could use the following Perl script (untested), though:
#!/usr/bin/perl
use strict;
use warnings;
{
local $/; # slurp mode
my $html = <>;
$html =~ s/ style='[^']*'//g;
print $html;
}
__END__
单衬将是:
$ perl -e 'local $/; $_ = <>; s/ style= 47[^ 47]* 47//g; print' fileA > fileB
这篇关于匹配 sed 中的任何字符(包括换行符)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!