C#转换相对于绝对链接的HTML字符串 [英] C# Convert Relative to Absolute Links in HTML String

查看:207
本文介绍了C#转换相对于绝对链接的HTML字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我镜像备份的目的,一些内部网站。截至目前基本上,我用这个C#代码:

  System.Net.WebClient客户端=新System.Net.WebClient() ; 
字节[](DL)= client.DownloadData(URL);

这基本上只是下载HTML和成字节数组。这就是我想要的。然而问题是,在HTML中的链接是最相对于时间的,而不是绝对的。



我基本上要追加任何完整的 http://domain.is < /一>以将其转换为绝对链接将重定向到原始内容的相对连结之前​​。我基本上只是关心HREF =和src =。是否有一个正则表达式,将介绍一些基本情况?



编辑[我的尝试]:

 公共静态字符串RelativeToAbsoluteURLS(字符串文本字符串absoluteUrl)
{
如果(String.IsNullOrEmpty(文本))
{
返回文本;
}

字符串值= Regex.Replace(
文本,
≤(*)(SRC | HREF)= \?(?! ?http)的(*)\。(*)>?,
&下; $ 1 $ 2 = \+ absoluteUrl +$ 3\$ 4 gt;中,
RegexOptions.IgnoreCase | RegexOptions.Multiline);

返回value.Replace(absoluteUrl +/,absoluteUrl);
}


解决方案

最强大的解决方案将是使用 HTMLAgilityPack 如其他人所说。然而,一个合理的解决方案使用正则表达式可能是使用更换重载需要的MatchEvaluator 委托,具体如下:

  VAR基本URI =新的URI(http://test.com); 
VAR模式= @(?<名称> SRC | HREF)=;(LT;价值> / [^] *?)
变种matchEvaluator =新MatchEvaluator(
匹配=>
{
VAR值= match.Groups [值]值;
乌里URI;

如果(Uri.TryCreate(基本URI,值了URI))
{
变量名称= match.Groups [名称]值;
返回的字符串。格式({0} = \{1} \,名称,uri.AbsoluteUri);
}

返回NULL;
});
VAR adjustedHtml = Regex.Replace(originalHtml,图案,matchEvaluator);



名为src和包含以正斜杠双引号值的href属性上述示例搜索。对于每个匹配,href=\"http://msdn.microsoft.com/en-us/library/ms131573.aspx\"> Uri.TryCreate 的使用方法的静态

请注意,此解决方案不处理单引号属性值和肯定行不通与带引号的值形成不好的HTML。


I'm mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(url);

This just basically downloads the html and into a byte array. This is what I want. The problem however is that the links within the html are most of the time relative, not absolute.

I basically want to append whatever the full http://domain.is before the relative link as to convert it to an absolute link that will redirect to the original content. I'm basically just concerned with href= and src=. Is there a regex expression that will cover some of the basic cases?

Edit [My Attempt]:

public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
{
    if (String.IsNullOrEmpty(text))
    {
        return text;
    }

    String value = Regex.Replace(
        text, 
        "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", 
        "<$1$2=\"" + absoluteUrl + "$3\"$4>", 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    return value.Replace(absoluteUrl + "/", absoluteUrl);
}

解决方案

The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
    match =>
    {
        var value = match.Groups["value"].Value;
        Uri uri;

        if (Uri.TryCreate(baseUri, value, out uri))
        {
            var name = match.Groups["name"].Value;
            return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
        }

        return null;
    });
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);

The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

这篇关于C#转换相对于绝对链接的HTML字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆