如何使我的网络爬虫与多种编码一起工作 [英] how to make my webcrawler work with mulitiple Encodings

查看：104 发布时间：2019/6/21 23:42:56 C# encoding Web

本文介绍了如何使我的网络爬虫与多种编码一起工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经对爬虫进行了编程，但是仅在页面以utf-8编码时才可以使用.您能否帮助使其与utf-8，gb2312或其他产品一起使用?
谢谢.我的代码如下:

I have programmed a crawler, but it can just work when the page is utf-8 encoded only. Could you help to make it work with utf-8, gb2312 or others?
Thanks. My code is like follows:

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Timeout = Settings.ConnectionTimeout * 1000;
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
string contentType = crawler.MimeType = response.ContentType;

if (contentType != "text/html" &&
    !crawler.Downloader.AllowAllMimeTypes &&
    !crawler.Downloader.FileTypes.Contains(contentType))
    return;

byte[] buffer = ReadInstreamIntoMemory(response.GetResponseStream());

response.Close();



if (!Directory.Exists(Settings.DownloadFolder))
    Directory.CreateDirectory(Settings.DownloadFolder);

// 保存页面(到网页库).
crawler.Status = CrawlerStatusType.Save;
if (crawler.Dirty)
    crawler.StatusChanged(crawler, null);



crawler.Downloader.CrawledUrlSet.Add(url);
crawler.Downloader.CrawleHistroy.Add(new CrawleHistroyEntry() { Timestamp = DateTime.UtcNow, Url = url, Size = response.ContentLength });
lock (crawler.Downloader.TotalSizelock)
{
    crawler.Downloader.TotalSize += response.ContentLength;
}

// 提取URL并加入队列.
UrlFrontierQueueManager queue = crawler.Downloader.UrlsQueueFrontier;

if (contentType == "text/html")
{

    crawler.Status = CrawlerStatusType.Parse;
    if (crawler.Dirty)
        crawler.StatusChanged(crawler, null);

    string html = Encoding.Default.GetString(buffer);

    string str = html;
    string regstr = @"[a-zA-Z0-9]+@([a-zA-Z0-9]+\.)+[a-zA-Z0-9]{2,3}";
    string mg = "";
    System.Text.RegularExpressions.Regex rg = new System.Text.RegularExpressions.Regex(regstr);


    System.Text.RegularExpressions.MatchCollection mc = rg.Matches(str);
    for (int i = 0; i < mc.Count; i++)
    {

        string xstr = mc[i].ToString();
        SQLiteConnection sqlliteconn = new SQLiteConnection(@"Data Source=" + Settings.SavePath);
        SQLiteCommand sqlcmd = new SQLiteCommand(sqlliteconn);
        sqlliteconn.Open();

        sqlcmd.CommandText = "select count(*) from rec_email where email like ''%"+xstr+"%''";
        object obj = sqlcmd.ExecuteScalar();
        if (obj != null&&int.Parse(obj.ToString())>0)
        {


        }
        else
        {
            if (!mg.Contains(mc[i].ToString()))
            {
                mg += "," + mc[i].ToString();
            }
        }
        sqlliteconn.Close();
    }

    if (mg != "")
    {
                SQLiteConnection sqlliteconn = new SQLiteConnection(@"Data Source=" + Settings.SavePath);
        SQLiteCommand sqlcmd = new SQLiteCommand(sqlliteconn);
        sqlliteconn.Open();

        sqlcmd.CommandText = "insert into rec_email(url,email) values(''"+url+"'',''"+mg+"'')";
        sqlcmd.ExecuteNonQuery();
        sqlliteconn.Close();

    }

    string baseUri = Utility.GetBaseUri(url);
    string[] links = Parser.ExtractLinks(baseUri, html);
    foreach (string link in links)
    {
               if (link.Length > 256) continue;
                if (crawler.Downloader.CrawledUrlSet.Contains(link)) continue;
             queue.Enqueue(link);
          }

请帮助我.在线上.谢谢.:rose:

Please help me .Waiting on line.Thanks.:rose:

如何使我的网络爬虫与多种编码一起工作 [英] how to make my webcrawler work with mulitiple Encodings

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

如何使我的网络爬虫与多种编码一起工作 [英] how to make my webcrawler work with mulitiple Encodings

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭