如何使我的网络爬虫与多种编码一起工作 [英] how to make my webcrawler work with mulitiple Encodings

查看:104
本文介绍了如何使我的网络爬虫与多种编码一起工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经对爬虫进行了编程,但是仅在页面以utf-8编码时才可以使用.您能否帮助使其与utf-8,gb2312或其他产品一起使用?
谢谢.我的代码如下:


I have programmed a crawler, but it can just work when the page is utf-8 encoded only. Could you help to make it work with utf-8, gb2312 or others?
Thanks. My code is like follows:


HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Timeout = Settings.ConnectionTimeout * 1000;
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
string contentType = crawler.MimeType = response.ContentType;

if (contentType != "text/html" &&
    !crawler.Downloader.AllowAllMimeTypes &&
    !crawler.Downloader.FileTypes.Contains(contentType))
    return;

byte[] buffer = ReadInstreamIntoMemory(response.GetResponseStream());

response.Close();



if (!Directory.Exists(Settings.DownloadFolder))
    Directory.CreateDirectory(Settings.DownloadFolder);

// 保存页面(到网页库).
crawler.Status = CrawlerStatusType.Save;
if (crawler.Dirty)
    crawler.StatusChanged(crawler, null);



crawler.Downloader.CrawledUrlSet.Add(url);
crawler.Downloader.CrawleHistroy.Add(new CrawleHistroyEntry() { Timestamp = DateTime.UtcNow, Url = url, Size = response.ContentLength });
lock (crawler.Downloader.TotalSizelock)
{
    crawler.Downloader.TotalSize += response.ContentLength;
}

// 提取URL并加入队列.
UrlFrontierQueueManager queue = crawler.Downloader.UrlsQueueFrontier;

if (contentType == "text/html")
{

    crawler.Status = CrawlerStatusType.Parse;
    if (crawler.Dirty)
        crawler.StatusChanged(crawler, null);

    string html = Encoding.Default.GetString(buffer);

    string str = html;
    string regstr = @"[a-zA-Z0-9]+@([a-zA-Z0-9]+\.)+[a-zA-Z0-9]{2,3}";
    string mg = "";
    System.Text.RegularExpressions.Regex rg = new System.Text.RegularExpressions.Regex(regstr);


    System.Text.RegularExpressions.MatchCollection mc = rg.Matches(str);
    for (int i = 0; i < mc.Count; i++)
    {

        string xstr = mc[i].ToString();
        SQLiteConnection sqlliteconn = new SQLiteConnection(@"Data Source=" + Settings.SavePath);
        SQLiteCommand sqlcmd = new SQLiteCommand(sqlliteconn);
        sqlliteconn.Open();

        sqlcmd.CommandText = "select count(*) from rec_email where email like ''%"+xstr+"%''";
        object obj = sqlcmd.ExecuteScalar();
        if (obj != null&&int.Parse(obj.ToString())>0)
        {


        }
        else
        {
            if (!mg.Contains(mc[i].ToString()))
            {
                mg += "," + mc[i].ToString();
            }
        }
        sqlliteconn.Close();
    }

    if (mg != "")
    {
                SQLiteConnection sqlliteconn = new SQLiteConnection(@"Data Source=" + Settings.SavePath);
        SQLiteCommand sqlcmd = new SQLiteCommand(sqlliteconn);
        sqlliteconn.Open();

        sqlcmd.CommandText = "insert into rec_email(url,email) values(''"+url+"'',''"+mg+"'')";
        sqlcmd.ExecuteNonQuery();
        sqlliteconn.Close();

    }

    string baseUri = Utility.GetBaseUri(url);
    string[] links = Parser.ExtractLinks(baseUri, html);
    foreach (string link in links)
    {
               if (link.Length > 256) continue;
                if (crawler.Downloader.CrawledUrlSet.Contains(link)) continue;
             queue.Enqueue(link);
          }



请帮助我.在线上.谢谢.:rose:



Please help me .Waiting on line.Thanks.:rose:

推荐答案

您应该将html读取为字节数组,并检查其上的编码是什么...
有很多可用的编码检测方法.

http://www.west-wind.com/Weblog/posts/197245.aspx [ ^ ]
You should read the html as a byte array and check what is the encoding on it...
There are plenty of encoding detecting methods available out there.

http://www.west-wind.com/Weblog/posts/197245.aspx[^]


这篇关于如何使我的网络爬虫与多种编码一起工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆