构建正则表达式以查找和替换无效的 HTML 属性 [英] Build regex to find and replace invalid HTML attributes

查看:26
本文介绍了构建正则表达式以查找和替换无效的 HTML 属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇文章的可悲事实是我的正则表达式技能很差.我最近在一个旧项目中遇到了一些我非常想做的代码.这是:

The sad truth about this post is that I have poor regex skills. I recently came across some code in an old project that I seriously want to do something about. Here it is:

strDocument = strDocument.Replace("font size=""1""", "font size=0.2")
strDocument = strDocument.Replace("font size='1'", "font size=0.2")
strDocument = strDocument.Replace("font size=1", "font size=0.2")
strDocument = strDocument.Replace("font size=""2""", "font size=1.5")
strDocument = strDocument.Replace("font size='2'", "font size=1.5")
strDocument = strDocument.Replace("font size=2", "font size=1.5")
strDocument = strDocument.Replace("font size=3", "font size=2")
strDocument = strDocument.Replace("font size=""3""", "font size=2")
strDocument = strDocument.Replace("font size='3'", "font size=2")

我猜有一些简单的正则表达式模式,我可以用它来找到引用属性值的不同方法,并用有效的语法替换它们.例如,如果有人写了一些 HTML,看起来像:

I'm guessing there is some easy regex pattern out there that I could use to find different ways of quoting attribute values and replace them with valid syntax. For example if somebody wrote some HTML that looks like:

<tag attribute1=value attribute2='value' />

我希望能够轻松清理该标签,使其看起来像

I'd like to be able to easily clean that tag so that it ends up looking like

<tag attribute1="value" attribute2="value" />

我正在使用的 Web 应用程序已有 10 年历史,由于缺少引号和大量其他垃圾,存在数千个验证错误,所以如果有人能帮助我,那就太好了!

The web application I'm working with is 10 years old and there are several thousand validation errors because of missing quotes and tons of other garbage, so if anybody could help me out that would be great!

我试了一下(找到了一些例子),有一些可以工作的东西,但希望它更聪明一点:

I gave it a whirl (found some examples), and have something that will work, but would like it to be a little smarter:

Dim input As String = "<tag attribute=value attribute='value' attribute=""value"" />"
Dim test As String = "attribute=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"
Dim result As String = Regex.Replace(input, test, "attribute=""$2""")

这会正确输出 result 为:

<tag attribute="value" attribute="value" attribute="value" />

有什么方法可以改变(并简化!)它,以便我可以让它查找任何属性名称?

Is there a way I could change (and simplify!) this up a bit so that I could make it look for any attribute name?

更新:

这是我目前根据评论所做的.也许它可以进一步改进:

Here's what I have so far based on the comments. Perhaps it could be improved even more:

Dim input As String = "<tag border=2 style='display: none' width=""100%"" />"
Dim test As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
Dim result As String = Regex.Replace(input, test, "=""$2""")

产生:

<tag border="2" style="display: none" width="100%" />

还有什么建议吗?否则我想我已经回答了我自己的问题,当然是在你的帮助下.

Any further suggestions? Otherwise I think I answered my own question, with your help of course.

推荐答案

这里是最终产品.我希望这对某人有所帮助!

Here is the final product. I hope this helps somebody!

Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim input As String = "<tag border=2 style='display: none' width=""100%"">Some stuff""""""in between tags==="""" that could be there</tag>" & _
            "<sometag border=2 width=""100%"" /><another that=""is"" completely=""normal"">with some content, of course</another>"

        Console.WriteLine(ConvertMarkupAttributeQuoteType(input, "'"))
        Console.ReadKey()
    End Sub

    Public Function ConvertMarkupAttributeQuoteType(ByVal html As String, ByVal quoteChar As String) As String
        Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
        Return Regex.Replace(html, findTags, New MatchEvaluator(Function(m) EvaluateTag(m, quoteChar)))
    End Function

    Private Function EvaluateTag(ByVal match As Match, ByVal quoteChar As String) As String
        Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"
        Return Regex.Replace(match.Value, attributes, String.Format("={0}$2{0}", quoteChar))
    End Function

End Module

我觉得将标记查找器和属性修复正则表达式彼此分开,以防我想在将来更改它们各自的工作方式.感谢您的所有投入.

I felt that keeping the tag finder and the attribute fixing regex separate from each other in case I wanted to change how they each work in the future. Thanks for all your input.

这篇关于构建正则表达式以查找和替换无效的 HTML 属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆