来自httpwebresponse的部分页面源 [英] partial page source from httpwebresponse

查看:69
本文介绍了来自httpwebresponse的部分页面源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对此很陌生,所以请原谅任何无知.

我创建了我的第一个多线程应用程序,其目的是进行大量的Web请求,解析每个页面源,并将结果存储在表中以供进一步查询.从理论上讲,可能有多达30-40000个请求,因此需要多线程.每个请求都有一个线程.我认为一切正常,除了我经常只获得非常部分的页面源代码.几乎就像StreamReader在读取响应时被打断一样.我使用相同的请求进入浏览器,并获取整个页面.我认为可能与线程有关,尽管我仍在同步进行调用. (理想情况下,我想异步进行调用,但是我不确定该怎么做.)是否有一种方法可以知道页面源代码是否完整,以便确定是否再次请求?我确信这里缺少我的复杂性.任何代码的任何帮助将不胜感激.

很抱歉格式化.以下是发出请求的类的代码的一部分:

using System;
using System.Collections.Generic;
using System.Text;
using System.Data.Sql;
using System.Data.SqlClient;
using System.Threading;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;  

namespace M4EverCrawler
{
    public class DomainRun
    {
        public void Start()
        {
            new Thread(new ThreadStart(this.Run1)).Start();  

            new Thread(new ThreadStart(this.Run2)).Start();

            new Thread(new ThreadStart(this.Run3)).Start();
        }


        public DomainRun(DNQueueManager dnq, ProxyQueueManager prxQ)
        {
            dnqManager = dnq;
            ProxyManager = prxQ;  
        }

        private DNQueueManager dnqManager;
        private ProxyQueueManager ProxyManager;
        public StagingQueue StagingQueue = new StagingQueue();
        public MetricsQueueManager MQmanager = new MetricsQueueManager();
        public CommitQueueManager CQmanager = new CommitQueueManager();


        protected void Run1()
        {
            dnqManager.LoadDNs();
            ProxyManager.LoadProxies();

            while (true)
            {
                if (dnqManager.IsDNDavailable)
                {
                    DomainData dnd = dnqManager.GetDND();
                    dnd.PageSource = CapturePage(dnd.DomainName);
                    StagingQueue.AddDN2Q(dnd);
                }
                Thread.Sleep(new Random().Next(20));
            }
        }


        protected void Run2()
        {
            while (true)
            {
                if (StagingQueue.IsDNDavailable)
                {
                    DomainData dnd = StagingQueue.GetDND();

                    MaxOutboundLinks = 3;
                    AvoidHttps = true;
                    InsideLinks = false;
                    VerifyBackLinks = true;

                    MQmanager.AddDN2Q(ParsePage(dnd));

                    foreach (string link in dnd.Hlinks)
                    {
                        DomainData dndLink = new DomainData(dnd.MainSeqno,link.ToString());
                        dndLink.ParentDomainName = dnd.DomainName;
                        dnd.PageSource = String.Empty;
                        MQmanager.AddDN2Q(dndLink);
                    }                    
                }
                Thread.Sleep(new Random().Next(20));
            }
        }


        protected void Run3()
        {
            while (true)
            {
                if (MQmanager.IsDNDavailable)
                {
                    DomainData dnd = MQmanager.GetDND();
                    RunAlexa(dnd);
                    RunCompete(dnd);
                    RunQuantcast(dnd);

                    CQmanager.AddDN2Q(dnd, MQmanager, 1000);
                }
                Thread.Sleep(new Random().Next(20));
            }
        }


        private string CapturePage(string URIstring)
        {
            Uri myUri;
            try
            {
                myUri = new Uri(URIstring);
            }
            catch (Exception URIex)
            {
                return String.Empty;
            }

            string proxyIP = ProxyManager.GetCurrentProxy() == "" ? ProxyManager.GetProxy() : ProxyManager.GetCurrentProxy();
            int proxCtr = 0;

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(myUri);
            WebProxy Proxy = new WebProxy(proxyIP);
            request.Proxy = Proxy;
            request.Timeout = 20000;

            try
            {
                using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
                {
                    using (StreamReader strmRdr = new StreamReader(response.GetResponseStream(), Encoding.ASCII))
                    {
                        return strmRdr.ReadToEnd();
                    }
                }
            }
            catch (InvalidOperationException Wex)
            {
                . . .
            }
        }

解决方案

您正在使用具有ASCII编码的StreamReader.如果服务器发送的数据没有有效的ASCII编码,则StreamReader不会将数据正确写入字符串中.

请注意,服务器可能会在响应标头上显式地放置页面编码,或者在页面内容本身中使用META标记.<​​/p>

以下页面显示了如何使用正确的编码下载数据:

You are using a StreamReader with an ASCII encoding. If the data being sent by the server does not have a valid ASCII encoding, then the StreamReader will not write the data correctly into the string.

Note that the server might be explicitly putting a page encoding on either the response headers, or using a META tag in the page content itself.

The following page shows you how to download data using the correct encodings: http://blogs.msdn.com/feroze_daud/archive/2004/03/30/104440.aspx

It is also possible that you are not getting the full entity body from the server, this could be due to a bad proxy, or something else.

Maybe you might want to add more diagnostics into your app. Log the #bytes downloaded, and the proxy used. Then you can do an Encoding.ASCII.GetBytes(string).Length and make sure that it is the same as the #bytes downloaded. if it is not, then you have a problem with page encodings. If that is not the case, then you have a bad proxy on the path.

Hope this helps.

这篇关于来自httpwebresponse的部分页面源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆