清除html / xml<标签>的最简单方法从单线输出 [英] Easiest method for removing html/xml <tags> from single-line output

查看:178
本文介绍了清除html / xml<标签>的最简单方法从单线输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 < words> Http:/ /www.path.com/words</words> 

我试过使用...

  sed's /<。*> //'

...删除标签,但这只会破坏整条线路。我不确定为什么发生这种情况,因为每个'<'在到达内容之前都用'>'关闭。



什么是最简单的方法做这个?



谢谢!

解决方案

/ p>

  sed's /<。*> \(。* \)< \ /.*& gt; / \ 1 /'

表达式快速分类

 <。*> - 匹配第一个标记
\(。* \) - 匹配并保存标记之间的文本
<\ /.*& gt; - 匹配结束标签,确保转义/字符
\1 - 输出第一个保存的匹配结果
- (在\(和\)之间匹配的文本)






/ b>
$ b 在评论中出现了一个问题,这个问题应该是完整的。



\( \)是Sed的反向参考标记。例如,如果我们有一个输入字符串:



blockquote>

这里有(parens)。此外,我们可以使用类似parens的这种方法,可以使用反向引用

我们开发了一个表达式:

  sed s /.*(\(。* \))。* \1\\(。* \)\ 1. * / \ 1 \ 2 / 

这给了我们:

  parens like this 

让我们分解表达式以找出答案。 p>

  sed s /  - 这是sed表达式的开始标记。 
。* - 匹配任何字符开始(以及没有)。
( - 匹配一个字面左括号字符
\(。* \) - 匹配任何字符并保存为反向引用。在这种情况下,它将匹配第一个开始和最后一个在表达式中关闭括号
) - 匹配文字右括号字符。
。* - 与上述相同。
\1 - 匹配第一个保存的反向引用。在我们的示例中,这是用`parens`
\(。* \)填充的 - 与上面相同。
\1 - 同上。
/ - 匹配表达式结束。信号转换到输出表达式。
\1 \2 - 打印我们的两个后退引用。
/ - 输出表达式结束。

我们可以看到,从括号()被替换回匹配表达式中,以匹配字符串 parens


I have output from grep I'm trying to clean up that looks like:

<words>Http://www.path.com/words</words>

I've tried using...

sed 's/<.*>//' 

...to remove the tags, but that just destroys the entire line. I'm not sure why that's happening, since every '<' is closed with a '>' before it gets to the content.

What is the easiest way to do this?

Thanks!

解决方案

Try this for your sed expression:

sed 's/<.*>\(.*\)<\/.*>/\1/'

Quick breakdown of the expression:

<.*>   - Match the first tag
\(.*\) - Match and save the text between the tags   
<\/.*> - Match the end tag making sure to escape the / character  
\1     - Output the result of the first saved match 
       -   (the text that is matched between \( and \))


More about back-references

A question came up in the comments that should probably be addressed for completeness.

The \( and \) are Sed's back-reference markers. They save a portion of the matched expression for use later.

For example, if we have an input string:

This has (parens) in it. In addition we can use parenslike thisparens using back-references.

We develop an expression:

sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/

Which gives us:

parens like this

How the heck did that work? Let's break down the expression to find out.

Expression breakdown:

sed s/ - This is the opening tag to a sed expression.
.*     - Match any character to start (as well as nothing).
(      - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
)      - Match a literal right parenthesis character.
.*     - Same as above.
\1     - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1     - Same as above.
/      - End of the match expression. Signals transition to the output expression.
\1 \2  - Print our two back-references.
/      - End of output expression.

As we can see, the back-reference taken from between the parenthesis (( and )) was substituted back into the matching expression to be able to match the string parens.

这篇关于清除html / xml&lt;标签&gt;的最简单方法从单线输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆