正则表达式c#从< a>中提取网址标签 [英] regex c# extracting url from <a> tag
问题描述
但是,我尝试从标记中提取URL,而不是获取 https://website.com/- id1 ,我正在获取标签链接文本.这是我的代码:
I am trying to extract URL from an tag, however, instead of getting https://website.com/-id1, I am getting tag link text. Here is my code:
string text="<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a>";
string parsed = Regex.Replace(text, " <[^>] + href =\"([^\"]+)\"[^>]*>", "$1 " );
parsed = Regex.Replace(parsed, "<[^>]+>", "");
Console.WriteLine(parsed);
我得到的结果是 MyLink ,这不是我想要的.我想要类似的东西
The result I got was MyLink which is not what I want. I want something like
https://website.com/-id1
我们将非常感谢您的帮助或链接.
Any help or a link will be highly appreciated.
推荐答案
正则表达式可以在非常具体,简单的HTML情况下使用.例如,如果文本仅包含 个标记,则可以使用"href\\s*=\\s*\"(?<url>.*?)\""
提取URL,例如:
Regular expressions can be used in very specific, simple cases with HTML. For example, if the text contains only a single tag, you can use "href\\s*=\\s*\"(?<url>.*?)\""
to extract the URL, eg:
var url=Regex.Match(text,"href\\s*=\\s*\"(?<url>.*?)\"").Groups["url"].Value;
此模式将返回:
https://website.com/-id1
此正则表达式没有任何花哨的功能.它会寻找可能带有空格的href=
,然后以非贪婪的方式(.*?
)捕获第一个双引号和下一个双引号之间的所有内容.这是在命名的组url
中捕获的.
This regex doesn't do anything fancy. It looks for href=
with possible whitespace and then captures anything between the first double quote and the next in a non-greedy manner (.*?
). This is captured in the named group url
.
任何花哨的事情都会变得非常复杂.例如,同时支持单引号和双引号将需要进行特殊处理,以避免避免以单引号开头和以双引号结尾.该字符串可以是使用两种引号的多个<a>
标记.
Anything more fancy and things get very complex. For example, supporting both single and double quotes would require special handling to avoid starting on a single and ending on a double quote. The string could multiple <a>
tags that used both types of quotes.
对于复杂的解析,最好使用 AngleSharp 或
For complex parsing it would be better to use a library like AngleSharp or HtmlAgilityPack
这篇关于正则表达式c#从< a>中提取网址标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!