清除html / xml<标签>的最简单方法从单线输出 [英] Easiest method for removing html/xml <tags> from single-line output
问题描述
< words> Http:/ /www.path.com/words</words>
我试过使用...
sed's /<。*> //'
...删除标签,但这只会破坏整条线路。我不确定为什么发生这种情况,因为每个'<'在到达内容之前都用'>'关闭。
什么是最简单的方法做这个?
谢谢!
/ p>
sed's /<。*> \(。* \)< \ /.*& gt; / \ 1 /'
表达式快速分类
<。*> - 匹配第一个标记
\(。* \) - 匹配并保存标记之间的文本
<\ /.*& gt; - 匹配结束标签,确保转义/字符
\1 - 输出第一个保存的匹配结果
- (在\(和\)之间匹配的文本)
/ b>
$ b 在评论中出现了一个问题,这个问题应该是完整的。
\(
和 \)
是Sed的反向参考标记。例如,如果我们有一个输入字符串:
blockquote>
这里有(parens)。此外,我们可以使用类似parens的这种方法,可以使用反向引用
。
我们开发了一个表达式:
sed s /.*(\(。* \))。* \1\\(。* \)\ 1. * / \ 1 \ 2 /
这给了我们:
parens like this
让我们分解表达式以找出答案。 p>
sed s / - 这是sed表达式的开始标记。
。* - 匹配任何字符开始(以及没有)。
( - 匹配一个字面左括号字符
\(。* \) - 匹配任何字符并保存为反向引用。在这种情况下,它将匹配第一个开始和最后一个在表达式中关闭括号
) - 匹配文字右括号字符。
。* - 与上述相同。
\1 - 匹配第一个保存的反向引用。在我们的示例中,这是用`parens`
\(。* \)填充的 - 与上面相同。
\1 - 同上。
/ - 匹配表达式结束。信号转换到输出表达式。
\1 \2 - 打印我们的两个后退引用。
/ - 输出表达式结束。
我们可以看到,从括号( (
和)
)被替换回匹配表达式中,以匹配字符串 parens
。
I have output from grep I'm trying to clean up that looks like:
<words>Http://www.path.com/words</words>
I've tried using...
sed 's/<.*>//'
...to remove the tags, but that just destroys the entire line. I'm not sure why that's happening, since every '<' is closed with a '>' before it gets to the content.
What is the easiest way to do this?
Thanks!
Try this for your sed expression:
sed 's/<.*>\(.*\)<\/.*>/\1/'
Quick breakdown of the expression:
<.*> - Match the first tag
\(.*\) - Match and save the text between the tags
<\/.*> - Match the end tag making sure to escape the / character
\1 - Output the result of the first saved match
- (the text that is matched between \( and \))
More about back-references
A question came up in the comments that should probably be addressed for completeness.
The \(
and \)
are Sed's back-reference markers. They save a portion of the matched expression for use later.
For example, if we have an input string:
This has (parens) in it. In addition we can use parenslike thisparens using back-references.
We develop an expression:
sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/
Which gives us:
parens like this
How the heck did that work? Let's break down the expression to find out.
Expression breakdown:
sed s/ - This is the opening tag to a sed expression.
.* - Match any character to start (as well as nothing).
( - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
) - Match a literal right parenthesis character.
.* - Same as above.
\1 - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1 - Same as above.
/ - End of the match expression. Signals transition to the output expression.
\1 \2 - Print our two back-references.
/ - End of output expression.
As we can see, the back-reference taken from between the parenthesis ((
and )
) was substituted back into the matching expression to be able to match the string parens
.
这篇关于清除html / xml<标签>的最简单方法从单线输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!