正则表达式记事本++ html替换 [英] Regex Notepad++ html search replace

查看:126
本文介绍了正则表达式记事本++ html替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用Notepad ++中的REGEX批处理(搜索和替换)几十万个html页面。
所有的HTML页面都有完全相同的布局,我基本上是试图将一个元素(标题)复制到页面标记中,但当前不是空的。

 < HTML> 
< head>
< title>一些标题< / title>
<大量垃圾和换行符>
< / head>
< body>
<很多东西,标签,内容>< span>东西< / span>< div>更多内容< / div>
< div id =uniqueID>
< span>应该复制到标题标题标签< / span>中的标题
< / div>
...其他的东西...< / body>

我可以找到:

 标题标签:< title>(。*?)< / title> 
以及包含REAL标题的范围:
(\s *< div id =uniqueID> \s *)< span>(。*)< / span>( \s *< / div>)

但我似乎无法拟合他们成为一个表达式(忽略之间的垃圾),以便能够在Notepad ++中搜索并替换它。



uniqueID div在每个页面中都相同(空格,换行符),除此之外没有其他内容。标题标签显然仅在每个页面中出现一次。我刚开始使用正则表达式,可能性无穷无尽。我知道这不是完美的解析HTML,但对于这种情况,它应该。任何人都知道如何将这两个表达式修补到一起忽略中间内容?

非常感谢!

span 中的标题复制到标题中 tag ...




  • 寻找内容 < ;标题>(。*)< / title>(。*< div id =uniqueID> \s *< span>([A-Za-z'] *)< / span> * 替换为:* < title> $ 3 < / title> $ 2



...如果您选择正则表达式并选中。在对话框中匹配newlin (是的,newlin而不是newline - 至少在我使用的机器上的Notepad ++版本中)。通过使用 $ 2 $ 3 ,您可以利用对组的捕获值的反向引用。



匹配 span s与标题的约束较少的模式会冒着抓住 span 例如:

 < html> 
< head>
< title>一些标题< / title>
<大量垃圾和换行符>
< / head>
< body>
<很多东西,标签,内容>< span>东西< / span>< div>更多内容< / div>
< div id =uniqueID>
< span>应该复制到标题标题标签< / span>中的标题
< / div>
< div>
< span>不应复制到标题的标题标签中的文字,而是< / span>
< / div>
...其他的东西...< / body>

如果从复制的标题跨度除了大写和小写字母字符,数字,空格和撇号以外,您还可以添加其他字符,然后可以根据需要添加到字符组 [A-Za-z'] (例如 [A-Za-z'_] 以包含下划线)。注意自己的HTML标记字符 - 例如< >


I am trying to batch process (search and replace) a couple hundred thousand html pages with REGEX in Notepad++. All the html pages have the exact same layout and I am basically trying to copy an element (a title) to the page tag wich isn't currently empty

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

I can find:

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

But I can't seem to be able to fit them into one expression (ignoring the junk in between) to be able to search and replace it in Notepad++.

The uniqueID div is the same in every pages (spaces, newlines), there is nothing else in it that the span with it's content. The title tag is obviously present only once in every pages. I just started with regular expressions and the possibilities are endless. I know it's not perfect for parsing HTML but for this case, it should. Anyone knows how to patch theses two expressions together to ignore the in-between content?

Thank you so much!

解决方案

You can use the following in Notepad++'s Replace dialog to copy the title in the span to the title tag...

  • Find what : <title>(.*)</title>(.*<div id="uniqueID">\s*<span>([A-Za-z ']*)</span>\s*</div>)
  • *Replace with : *<title>$3</title>$2

...if you select Regular expression and check . matches newlin in the dialog (yes, "newlin" rather than "newline" - at least in the version of Notepad++ on the machine I am using). By using $2 and $3 you are leveraging backreferences to groups' captured values.

A less constrained pattern to match the spans with the titles runs the risk of grabbing spans later in the files - for example:

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

If the titles to copy from the spans have additional characters other than uppercase and lowercase alpha characters, digits, spaces, and apostrophes, then you can add to the character group [A-Za-z '] as needed (e.g. [A-Za-z '_] to include underscores). Just watch out for HTML markup characters themselves - e.g. < and >.

这篇关于正则表达式记事本++ html替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆