网页编码不同时如何获得网页标题 [英] How get webpages title when they are encoded differently

查看:76
本文介绍了网页编码不同时如何获得网页标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一种下载网页并提取标题标签的方法,但是根据网站的不同,结果可能会被编码或使用错误的字符集.当网站的编码方式不同时,是否有防弹方法来获得网站标题?

I have a method that download a webpage and extract the title tag but depending of the website, the result can be encoded or in the wrong character set. Is there a bulletproof way to get websites title when they are encoded differently?

一些我测试过的网址具有不同的结果:

Some urls that i have tested with different result:

  • https://fr.wikipedia.org/wiki/Québec return "Québec — Wikipédia". The result is good.
  • http://www.remax-quebec.com/fr/index.rmx return "Condo, chalet ou maison &agrave vendre avec un courtier immobilier | RE/MAX Qu&eacutebec".
  • http://www.restomontreal.ca/ return "Restaurants Montr�al | RestoMontreal"

我使用的方法:

private string GetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {
        HttpResponseMessage response = null;

        response = client.GetAsync(uri).Result;

        if (!response.IsSuccessStatusCode)
        {
            string errorMessage = "";

            try
            {
                XmlSerializer xml = new XmlSerializer(typeof(HttpError));
                HttpError error = xml.Deserialize(response.Content.ReadAsStreamAsync().Result) as HttpError;
                errorMessage = error.Message;
            }
            catch (Exception)
            {
                errorMessage = response.ReasonPhrase;
            }

            throw new Exception(errorMessage);
        }

        var html = response.Content.ReadAsStringAsync().Result;
        title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
    }

    if (title == string.Empty)
    {
        title = uri.ToString();
    }

    return title;
}

推荐答案

字符集并不总是出现在标头中,因此我们还必须检查meta标记,或者如果不存在,则回退到UTF8(或其他方式).另外,标题可能已编码,因此我们只需要对其进行解码即可.

The charset is not always present in the header so we must also check for the meta tags or if it's not there neither, fallback to UTF8 (or something else). Also, the title might be encoded so we just need to decode it.

结果

  • https://fr.wikipedia.org/wiki/Québec return "Québec — Wikipédia".
  • http://www.remax-quebec.com/fr/index.rmx return "Condo, chalet ou maison à vendre avec un courtier immobilier | RE/MAX Québec".
  • http://www.restomontreal.ca/ return "Restaurants Montréal | RestoMontreal"

下面的代码来自github项目 Abot .我已经对其进行了一些修改.

The code below come from the github project Abot. I have modified it a little bit.

private string GetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {
        HttpResponseMessage response = client.GetAsync(uri).Result;

        if (!response.IsSuccessStatusCode)
        {
            throw new Exception(response.ReasonPhrase);
        }

        var contentStream = response.Content.ReadAsStreamAsync().Result;
        var charset = response.Content.Headers.ContentType.CharSet ?? GetCharsetFromBody(contentStream);                

        Encoding encoding = GetEncodingOrDefaultToUTF8(charset);
        string content = GetContent(contentStream, encoding);

        Match titleMatch = Regex.Match(content, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase);

        if (titleMatch.Success)
        {
            title = titleMatch.Groups["Title"].Value;

            // decode the title in case it have been encoded
            title = WebUtility.HtmlDecode(title).Trim();
        }
    }

    if (string.IsNullOrWhiteSpace(title))
    {
        title = uri.ToString();
    }

    return title;
}

private string GetContent(Stream contentStream, Encoding encoding)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    using (StreamReader sr = new StreamReader(contentStream, encoding))
    {
        return sr.ReadToEnd();
    }
}

/// <summary>
/// Try getting the charset from the body content.
/// </summary>
/// <param name="contentStream"></param>
/// <returns></returns>
private string GetCharsetFromBody(Stream contentStream)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    StreamReader srr = new StreamReader(contentStream, Encoding.ASCII);
    string body = srr.ReadToEnd();
    string charset = null;

    if (body != null)
    {
        //find expression from : http://stackoverflow.com/questions/3458217/how-to-use-regular-expression-to-match-the-charset-string-in-html
        Match match = Regex.Match(body, @"<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s""']*)?([^>]*?)[\s""';]*charset\s*=[\s""']*([^\s""'/>]*)", RegexOptions.IgnoreCase);

        if (match.Success)
        {
            charset = string.IsNullOrWhiteSpace(match.Groups[2].Value) ? null : match.Groups[2].Value;
        }
    }

    return charset;
}

/// <summary>
/// Try parsing the charset or fallback to UTF8
/// </summary>
/// <param name="charset"></param>
/// <returns></returns>
private Encoding GetEncodingOrDefaultToUTF8(string charset)
{
    Encoding e = Encoding.UTF8;

    if (charset != null)
    {
        try
        {
            e = Encoding.GetEncoding(charset);
        }
        catch
        {
        }
    }

    return e;
}

这篇关于网页编码不同时如何获得网页标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆