如何在多次HTML或条件之间提取文本 [英] How To Extract Text Between HTML Tags With Or Condition Multiple Times
问题描述
我一直在研究如何从html中提取标题标签。我很清楚,正则表达式和html不会混合,并且可以使用grep。但是,我在此处找到的代码如下所示:
awk -vRS =< / title> '/<title>/{gsub(/.*<title>|\\\
+/,\"\");print;exit}'
现在,这只能在标题标签之间找到一次文本。我想知道如何让它在每一行上都能运行。我可以做一个 cat文件;而读线;做...;完成
。然而,我知道这可能不是非常有效的,因为有更好的方法。其次,在文件中我需要保留任何以字符串' - '开始的行, 。我相信这需要在 awk
中添加一个'或'语句,以便匹配标题标签和以' - '开头的任何行
输入文件如下所示:
text text text< title>随机文本标题1< / title>随机html文件
--time--
xyz更多随机文本< title>标题2的随机文本< / title> hmtl text
--time--
一些文字< title>标题3的随机文字< / title>更多文字标签
--time--
text here< title>标题4的随机文本< / title>随机文本html
--time--
所需输出:
< title>标题1的随机文本< / title>
--time--
< title>标题2的随机文本< / title>
--time--
< title>标题3的随机文本< / title>
--time--
< title>标题4的随机文本< / title>
--time--
awk并不是那么棒,学习。我知道应该有一个打印所有的选项,但这是我真正坚持的OR语句。如果您认为效率更高,我可以接受sed或grep。任何帮助或方向,不胜感激。
对于您的输入, grep
就足够了
$ grep -o'<。*> \ | ^ - 。*'ip.html
< title> ;标题的随机文本1< / title>
--time--
< title>标题2的随机文本< / title>
--time--
< title>标题3的随机文本< / title>
--time--
< title>标题4的随机文本< / title>
--time -
-
-o
只提取匹配的零件
<。*>
从<$ c $提取
\ 中的
直到最后>
| ^ - 。* 如果行以-
开头,则从该行开始获取所有内容
仅限于
标题
标记,grep -o'< title。* title> \ | ^ - 。*'ip.html
I have been researching how to extract title tags from html. I've pretty much figured out that regex and html don't mix and that grep can be used. However, the code I found here, looks like this:
awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'
Now, this works to find the text between title tags only once. I would like to know how I can make it run on every line. I could do a
cat file; while read line; do ...; done
. However, I know that is probably not very efficient an there's a better way.Secondly, in the file I need to keep any lines that start with string '--'. I believe this requires adding an 'or' statement in
awk
so that it will match the title tags and any line starting with '--'The input file would look like this:
text text text <title>random text of the title 1</title> random html stuff --time-- xyz more random text <title>random text of the title 2</title> hmtl text --time-- some text <title>random text of the title 3</title> more text tags --time-- text here <title>random text of the title 4</title> random text html --time--
The desired output:
<title>random text of the title 1</title> --time-- <title>random text of the title 2</title> --time-- <title>random text of the title 3</title> --time-- <title>random text of the title 4</title> --time--
I'm not that great with awk, but I'm learning. I know there should be an option to print all, but it's the OR statement that I'm really stuck on. I am open to sed or grep if you think that's more efficient. Any help or direction is greatly appreciated.
解决方案For your given input,
grep
is enough$ grep -o '<.*>\|^--.*' ip.html <title>random text of the title 1</title> --time-- <title>random text of the title 2</title> --time-- <title>random text of the title 3</title> --time-- <title>random text of the title 4</title> --time--
-o
extract only matching parts<.*>
extract from<
upto last>
in the line\|^--.*
alternate pattern, if line starts with--
get everything from that line
To restrict only to
title
tags,grep -o '<title.*title>\|^--.*' ip.html
这篇关于如何在多次HTML或条件之间提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!