获取“标题”使用Regex从html链接获得属性 [英] Get "Title" attribute from html link using Regex
问题描述
我有以下正则表达式来匹配由我们的自定义cms生成的页面上的所有链接标记。 $ < a \ S +((:( ?: \w + \s * = \s *)(?: \w + |? [^] *| '[^'] *'))?* \s * href\s * = \s *?(?< URL> \w + | [^] *| '[^'] *')(:( ?: \s + \w + \\ \\ * *(*:\ s *)(?: \ w + |[^] *|'[^'] *'))*?)>。+?< / a>
我们使用c#循环遍历所有匹配项并在每个链接(用于跟踪软件)之前添加一个onclick事件渲染页面内容
我需要解析链接并为onclick函数添加一个参数,即链接名称。
我打算修改正则表达式以获取以下子组:
图片的替代文本
链接文本 b然后我可以检查每个小组与aqquir的匹配情况e相关的链接名称。
如何修改上述正则表达式来完成此操作,或者我可以使用c#代码实现相同的想法?
正则表达式在解析HTML时存在根本性问题(请参阅您能否提供一些为什么很难用正则表达式分析XML和HTML?)为什么)。你需要的是一个HTML解析器。有关使用各种解析器的示例,请参阅您能否提供一个使用您最喜爱的解析器解析HTML的示例?。
特别是您可能对 HTMLAgilityPack答案。
I have the following Regex to match all link tags on a page generated from our custom cms
<a\s+((?:(?:\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|"[^"]*"|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?)>.+?</a>
We are using c# to loop through all matches of this and add an onclick event to each link (for tracking software) before rendering the page content. I need to parse the link and add a parameter to the onclick function which is the "link name".
I was going to modify the regex to get the following subgroups
- The title attribute of the link
- If the link contains an image tag get the alt text of the image
- The text of the link
I can then check the match of each subgroup to aqquire the relevant name of the link.
How would I modify the above regex to do this or could I achieve the same think using c# code?
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you may be interested in the HTMLAgilityPack answer.
这篇关于获取“标题”使用Regex从html链接获得属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!