Linux文本文件操作 [英] Linux Text File Manipulation
问题描述
我有格式的文件:
<a href="http://www.wowhead.com/?search=Superior Mana Oil">
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">
我需要选择=之后,但在文本中,并在该行的末尾打印此,将使其成为例如:
I need to select the text after the = but before the " and print this at the end of the line, adding so it becomes for example:
<a href="http://www.wowhead.com/?search=Superior Mana Oil">Superior Mana Oil</a>
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">Tabard of Brute Force</a>
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">Tabard of the Wyrmrest Accord</a>
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">Tattered Hexcloth Sack</a>
我不知道的通过Linux命令行做到这一点的最佳方式(我猜大概的sed / awk的,但是不与他们好),将理想像一个剧本,我可以只给文件名如./fixlink.sh brokenlinks.txt
I'm not sure of the best way to do this via linux command line (I guess probably sed/awk but not good with them), would ideally like a script I can just feed the filename e.g. ./fixlink.sh brokenlinks.txt
推荐答案
假设你可以有一个或AFER更多空间&LT; A
,和周围的零个或更多的空间 =
标志,下面应该工作:
Assuming you can have one or more space afer <a
, and zero or more space around the =
signs, the following should work:
$ cat in.txt
<a href="http://www.wowhead.com/?search=Superior Mana Oil">
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">
#
# The command to do the substitution
#
$ sed -e 's#<a[ \t][ \t]*href[ \t]*=[ \t]*".*search[ \t]*=[ \t]*\([^"]*\)">#&\1</a>#' in.txt
<a href="http://www.wowhead.com/?search=Superior Mana Oil">Superior Mana Oil</a>
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">Tabard of Brute Force</a>
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">Tabard of the Wyrmrest Accord</a>
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">Tattered Hexcloth Sack</a>
如果你确定你没有多余的空间,模式简化为:
If you're sure you don't have the extra spaces, the pattern simplifies to:
s#<a href=".*search=\([^"]*\)">#&\1</a>#
在 SED
,取值
后跟任意字符(#
在这种情况下)开始替换。被替换的模式,直到同一性质的第二次亮相。所以,在我们的第二个例子,要被替换的模式是:&LT; A HREF =([* \\&GT;
*搜索= \\ ^])。我用 \\([^] * \\)
来的意思是,非任何序列 - 字符,并保存它的反向引用
\\ 1
(即 \\(\\)
对表示反向引用),最后,下一个标记被分隔#
是替换&放大器;
在 SED
表示任何匹配,在这种情况下是整条生产线,而 \\ 1
只是匹配的链接文本。
In sed
, s
followed by any character (#
in this case) starts substitution. The pattern to be substituted is until the second appearance of the same character. So, in our second example, the pattern to be substituted is: <a href=".*search=\([^"]*\)">
. I used \([^"]*\)
to mean, any sequence of non-"
characters, and saved it in backreference \1
(the \(\)
pair denotes a backreference). Finally, the next token delimited by #
is the replacement. &
in sed
stands for "whatever matched", which in this case is the whole line, and \1
just matches the link text.
这里的样式再次:
's#<a[ \t][ \t]*href[ \t]*=[ \t]*".*search[ \t]*=[ \t]*\([^"]*\)">#&\1</a>#'
及其说明:
' quote so as to avoid shell interpreting the characters
s substitute
# delimiter
<a[ \t][ \t]* <a followed by one or more whitespace
href[ \t][ \t]*=[ \t]* href followed by optional space, = followed by optional space
".*search[ \t]*=[ \t]* " followed by as many characters as needed, followed by
search, optional space, =, followed by optional space
\([^"]*\) a sequence of non-" characters, saved in \1
"> followed by ">
# delimiter, replacement pattern starts
&\1 the matched pattern, followed by backreference \1.
</a> end the </a> tag
# end delimiter
' end quote
如果你的真正的肯定总是会有搜索=
其次是你想要的,你可以做文字:
If you're really sure that there will always be search=
followed by the text you want, you can do:
$ sed -e 's#.*search=\(.*\)">#&\1</a>#'
希望有所帮助。
这篇关于Linux文本文件操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!