sed 从 html 文件中删除标签 [英] Sed remove tags from html file

查看：32 发布时间：2021/12/3 13:42:27 html regex linux bash

本文介绍了sed 从 html 文件中删除标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要使用 sed 命令从带有 bash 脚本的 html 中删除所有标签.我试过这个

sed -r 's/[<][/]?[a-zA-Z0-9="-#.& ]+[/]?[>]//g' $1

还有这个

sed -r 's/[<][/]?[.]*[/]?[\]?[>]//g' $1

但我还是想念一些东西，有什么建议吗??

解决方案

您可以使用众多 HTML 到文本转换器，如果可能，使用 Perl 正则表达式 <.+?> 或者如果它必须 sed 使用 <[^>]*>

sed -e 's/<[^>]*>//g' file.html

如果没有错误余地，请改用 HTML 解析器.例如.当一个元素分布在两行上时

这个正则表达式不起作用.

<小时>这个正则表达式由三部分组成 <, [^>]*, > 
搜索打开<
后跟零个或多个字符*，它们不是结束的>
[...] 是一个 字符类，当它以 ^ 开头，在类中查找不是的字符
最后寻找关闭>
更简单的正则表达式 <.*> 将不起作用，因为它搜索可能的最长匹配，即输入行中最后一个关闭的 >.例如，当您在输入行中有多个标签时
Olaf回答问题.
将导致
<块引用>回答问题.
代替
<块引用>奥拉夫回答问题.
另见用星号和加号重复，特别是注意贪婪！ 和下面的详细解释.
I need to remove all tags from a html with a bash script using the sed command.
I tried with this
sed -r 's/[<][/]?[a-zA-Z0-9="-#.& ]+[/]?[>]//g' $1
and whith this
sed -r 's/[<][/]?[.]*[/]?[\]?[>]//g' $1
but I still miss something, any suggestions??
 解决方案 
You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*> 
sed -e 's/<[^>]*>//g' file.html
If there's no room for errors, use an HTML parser instead.
E.g. when an element is spread over two lines
<div
>Lorem ipsum</div>
this regular expression will not work.



This regular expression consists of three parts <, [^>]*, > 


search for opening <
followed by zero or more characters *, which are not the closing >

[...] is a character class, when it starts with ^ look for characters not in the class
and finally look for closing >


The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line 
<name>Olaf</name> answers questions.
will result in 

  answers questions.
instead of 

  Olaf answers questions.
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.

                        这篇关于sed 从 html 文件中删除标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

sed 从 html 文件中删除标签 [英] Sed remove tags from html file

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

sed 从 html 文件中删除标签 [英] Sed remove tags from html file

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭