如何在多次HTML或条件之间提取文本 [英] How To Extract Text Between HTML Tags With Or Condition Multiple Times

查看:128
本文介绍了如何在多次HTML或条件之间提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究如何从html中提取标题标签。我很清楚,正则表达式和html不会混合,并且可以使用grep。但是,我在此处找到的代码如下所示:

  awk -vRS =< / title> '/<title>/{gsub(/.*<title>|\\\
+/,\"\");print;exit}'

现在,这只能在标题标签之间找到一次文本。我想知道如何让它在每一行上都能运行。我可以做一个 cat文件;而读线;做...;完成。然而,我知道这可能不是非常有效的,因为有更好的方法。其次,在文件中我需要保留任何以字符串' - '开始的行, 。我相信这需要在 awk 中添加一个'或'语句,以便匹配标题标签和以' - '开头的任何行



输入文件如下所示:

  text text text< title>随机文本标题1< / title>随机html文件
--time--
xyz更多随机文本< title>标题2的随机文本< / title> hmtl text
--time--
一些文字< title>标题3的随机文字< / title>更多文字标签
--time--
text here< title>标题4的随机文本< / title>随机文本html
--time--

所需输出:

 < title>标题1的随机文本< / title> 
--time--
< title>标题2的随机文本< / title>
--time--
< title>标题3的随机文本< / title>
--time--
< title>标题4的随机文本< / title>
--time--

awk并不是那么棒,学习。我知道应该有一个打印所有的选项,但这是我真正坚持的OR语句。如果您认为效率更高,我可以接受sed或grep。任何帮助或方向,不胜感激。

解决方案

对于您的输入, grep 就足够了

  $ grep -o'<。*> \ | ^  - 。*'ip.html 
< title> ;标题的随机文本1< / title>
--time--
< title>标题2的随机文本< / title>
--time--
< title>标题3的随机文本< / title>
--time--
< title>标题4的随机文本< / title>
--time -




  • -o 只提取匹配的零件
  • <。*> 从<$ c $提取

  • \ 中的 直到最后> | ^ - 。* 如果行以 - 开头,则从该行开始获取所有内容



    • 仅限于标题标记,

        grep -o'< title。* title> \ | ^  - 。*'ip.html 


      I have been researching how to extract title tags from html. I've pretty much figured out that regex and html don't mix and that grep can be used. However, the code I found here, looks like this:

      awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'
      

      Now, this works to find the text between title tags only once. I would like to know how I can make it run on every line. I could do a cat file; while read line; do ...; done. However, I know that is probably not very efficient an there's a better way.

      Secondly, in the file I need to keep any lines that start with string '--'. I believe this requires adding an 'or' statement in awk so that it will match the title tags and any line starting with '--'

      The input file would look like this:

      text text text <title>random text of the title 1</title> random html stuff
      --time--
      xyz more random text <title>random text of the title 2</title> hmtl text
      --time--
      some text <title>random text of the title 3</title> more text tags
      --time--
      text here <title>random text of the title 4</title> random text html
      --time--
      

      The desired output:

      <title>random text of the title 1</title>
      --time--
      <title>random text of the title 2</title>
      --time--
      <title>random text of the title 3</title>
      --time--
      <title>random text of the title 4</title>
      --time--
      

      I'm not that great with awk, but I'm learning. I know there should be an option to print all, but it's the OR statement that I'm really stuck on. I am open to sed or grep if you think that's more efficient. Any help or direction is greatly appreciated.

      解决方案

      For your given input, grep is enough

      $ grep -o '<.*>\|^--.*' ip.html 
      <title>random text of the title 1</title>
      --time--
      <title>random text of the title 2</title>
      --time--
      <title>random text of the title 3</title>
      --time--
      <title>random text of the title 4</title>
      --time--
      

      • -o extract only matching parts
      • <.*> extract from < upto last > in the line
      • \|^--.* alternate pattern, if line starts with -- get everything from that line

      To restrict only to title tags,

      grep -o '<title.*title>\|^--.*' ip.html
      

      这篇关于如何在多次HTML或条件之间提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆