如何在多次HTML或条件之间提取文本 [英] How To Extract Text Between HTML Tags With Or Condition Multiple Times

查看：128 发布时间：2018/5/28 19:41:14 linux bash awk sed grep

本文介绍了如何在多次HTML或条件之间提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在研究如何从html中提取标题标签。我很清楚，正则表达式和html不会混合，并且可以使用grep。但是，我在此处找到的代码如下所示：

  awk -vRS =< / title> '/<title>/{gsub(/.*<title>|\\\
+/,\"\");print;exit}'

现在，这只能在标题标签之间找到一次文本。我想知道如何让它在每一行上都能运行。我可以做一个 cat文件;而读线;做...;完成。然而，我知道这可能不是非常有效的，因为有更好的方法。其次，在文件中我需要保留任何以字符串' - '开始的行，。我相信这需要在 awk 中添加一个'或'语句，以便匹配标题标签和以' - '开头的任何行

输入文件如下所示：

  text text text< title>随机文本标题1< / title>随机html文件
 --time-- 
 xyz更多随机文本< title>标题2的随机文本< / title> hmtl text 
 --time-- 
一些文字< title>标题3的随机文字< / title>更多文字标签
 --time-- 
 text here< title>标题4的随机文本< / title>随机文本html 
 --time--

所需输出：

 < title>标题1的随机文本< / title> 
 --time-- 
< title>标题2的随机文本< / title> 
 --time-- 
< title>标题3的随机文本< / title> 
 --time-- 
< title>标题4的随机文本< / title> 
 --time--

awk并不是那么棒，学习。我知道应该有一个打印所有的选项，但这是我真正坚持的OR语句。如果您认为效率更高，我可以接受sed或grep。任何帮助或方向，不胜感激。

解决方案

对于您的输入， grep 就足够了

  $ grep -o'<。*> \ | ^  - 。*'ip.html 
< title> ;标题的随机文本1< / title> 
 --time-- 
< title>标题2的随机文本< / title> 
 --time-- 
< title>标题3的随机文本< / title> 
 --time-- 
< title>标题4的随机文本< / title> 
 --time  -

-o 只提取匹配的零件

<。*> 从<$ c $提取

\ 中的 直到最后> | ^ - 。* 如果行以 - 开头，则从该行开始获取所有内容

仅限于标题标记，
grep -o'< title。* title> \ | ^ - 。*'ip.html

I have been researching how to extract title tags from html. I've pretty much figured out that regex and html don't mix and that grep can be used. However, the code I found here, looks like this:
awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'
Now, this works to find the text between title tags only once. I would like to know how I can make it run on every line. I could do a cat file; while read line; do ...; done. However, I know that is probably not very efficient an there's a better way.

Secondly, in the file I need to keep any lines that start with string '--'. I believe this requires adding an 'or' statement in awk so that it will match the title tags and any line starting with '--'

The input file would look like this:
text text text <title>random text of the title 1</title> random html stuff --time-- xyz more random text <title>random text of the title 2</title> hmtl text --time-- some text <title>random text of the title 3</title> more text tags --time-- text here <title>random text of the title 4</title> random text html --time--
The desired output:
<title>random text of the title 1</title> --time-- <title>random text of the title 2</title> --time-- <title>random text of the title 3</title> --time-- <title>random text of the title 4</title> --time--
I'm not that great with awk, but I'm learning. I know there should be an option to print all, but it's the OR statement that I'm really stuck on. I am open to sed or grep if you think that's more efficient. Any help or direction is greatly appreciated.
解决方案
For your given input, grep is enough
$ grep -o '<.*>\|^--.*' ip.html <title>random text of the title 1</title> --time-- <title>random text of the title 2</title> --time-- <title>random text of the title 3</title> --time-- <title>random text of the title 4</title> --time--

-o extract only matching parts

<.*> extract from < upto last > in the line

\|^--.* alternate pattern, if line starts with -- get everything from that line

To restrict only to title tags,
grep -o '<title.*title>\|^--.*' ip.html

这篇关于如何在多次HTML或条件之间提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在多次HTML或条件之间提取文本 [英] How To Extract Text Between HTML Tags With Or Condition Multiple Times

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

如何在多次HTML或条件之间提取文本 [英] How To Extract Text Between HTML Tags With Or Condition Multiple Times

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭