使HTML for Agilegy Pack处理HTML的最佳方法是什么? [英] What is the best way to get the HTML for HTML Agiligy Pack to process?

查看:56
本文介绍了使HTML for Agilegy Pack处理HTML的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎无法从一些站点获取HTML,但是可以从许多其他站点获取HTML.这是我遇到问题的2个网站:

I can't seem to get the HTML from a few sites, but can from many others. Here are 2 sites I am having issues with:

https://www.rei.com

https://www.homedepot.com

我正在构建一个应用程序,该应用程序将从用户输入的URL获取元标记信息.一旦获得HTML代码,就可以使用HTML Agility Pack处理它,并且可以完美地工作.问题在于从各个网站获取HTML.

I am building an app that will get meta tag info from a URL that the user enters. Once I get the HTML the code, I process it using HTML Agility pack and it works perfectly. The problem is with getting the HTML from various websites.

我尝试了各种方法来获取HTML(HtmlWebHttpWebRequest等),这些方法都设置了用户代理(与chrome相同的代理标签),标头,Cookie和自动重定向,gzip-ing,看起来像每种组合.所有人都以Fiddler身份进行了验证,但是我似乎无法弄清楚为什么我不能从某些站点获取HTML,它们只是超时,而我可以在浏览器中提取相同的URL时就很好了.我发送的标题与Fiddler相同. 有谁知道是什么原因导致URL不返回HTML/数据?还是有人拥有NuGet软件包或框架来处理获取HTML页面/文档的所有细微差别,无论网站是否为SSL,采用gzip压缩,是否需要Cookie,重定向等?

I have tried various ways to get the HTML (HtmlWeb, HttpWebRequest and others) all with setting the user-agent (same agent tag as chrome), headers, cookies and autoredirect, gzip-ing and seems like every combination. All verified by looking as Fiddler, but I can't seem to figure out why I can't get the HTML from some sites, they just timeout, when I can pull up that same URL in my browser just fine. The headers that I send look the same as Fiddler. Does anyone know what is causing the URL's to not return the HTML/data? Or does anyone have a NuGet package or framework that handles all the nuances of getting the HTML page/document, whether the website is SSL, gzip'ed, requires cookies, redirects, etc?

进入这个项目,我认为最困难的部分将是处理HTML,而没有得到它,因此将不胜感激.

Going into this project I thought the hardest part would be processing the HTML not getting it so any help would be appreciated.

更新1:

我尝试过,但似乎无法正常工作...我一定很容易错过一些东西...这是一个更新的示例,其中包含一些建议的更改.

I tried but I just can't seem to get it to work... I must be missing something easy... here is an updated example with some of the suggested changes.

https://dotnetfiddle.net/tQyav7

我不得不在dotnetfiddle上注释掉ServerCertificateValidationCallback,因为它在那里抛出了错误,但是它不在我的开发框中.我还必须将超时设置为仅5秒...在我的开发箱中将其设置为20秒.任何帮助将不胜感激.

I had to comment out the ServerCertificateValidationCallback on dotnetfiddle because it was throwing an error there, but it isn't not on my dev box. I also had to set the timeout to only 5 seconds... I have it at 20 on my dev box. Any help would be appreciated.

推荐答案

这是您的 helper 类,经过重构可以支持HttpWebResponse可以处理的大多数Web响应.

This is your helper class, refactored to support most the web responses that a HttpWebResponse can handle.

注意:如果没有将 Option Explicit Option Strict 设置为True,则永远不要进行此类设置:对的.自动推理不是您的朋友(嗯,实际上从来没有;您真的需要知道您要处理的对象).

A note: never do this kind of setups if you don't have Option Explicit and Option Strict set to True: you'll never get it right. Automatic inference is not your friend here (well, actually never is; you really need to know what objects you're dealing with).

已修改的内容和重要的处理方法:

What has been modified and what is important handle:

  • Tls处理:对Tls 1.1,Tls 1.2和当前框架可以处理的最大协议版本的扩展支持:

  • Tls handling: extended support for Tls 1.1, Tls 1.2 and the maximum protocol version that the current framework can handle:

System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()

  • WebRequest.ServicePoint.Expect100Continue = False:除非您准备遵守,否则您永远都不会想要这种回应.但这从来没有必要.

  • WebRequest.ServicePoint.Expect100Continue = False: you never want this kind of response, unless you're ready to comply. But it's never necessary.

    [AutomaticDecompression][1]是必需的,除非您要手动处理GZip或Deflate流.几乎不需要(仅当您要在解压缩之前分析原始流时).

    [AutomaticDecompression][1] is required, unless you want to handle the GZip or Deflate streams manually. It's almost never required (only if you want to analyze the original stream before decompressing it).

    每次都会重新构建CookieContainer.尚未修改,但您可以存储静态对象,并在每个请求中重用Cookies:某些站点在执行Tls握手时可能会设置cookie,并重定向到登录页面. WebRequest可以用于POST身份验证参数(验证码除外),但是您需要保留Cookies,否则任何其他请求都将不被身份验证.

    The CookieContainer is rebuilt every time. This has not been modified, but you could store a static object and reuse the Cookies with each request: some sites may set the cookies when the Tls handshake is performed and redirect to a login page. A WebRequest can be used to POST authentication parameters (except captchas), but you need to preserve the Cookies, otherwise any further request won't be authenticated.

    响应流 ReadToEnd() 方法也照原样保留,但是您应该对其进行修改以读取缓冲区.例如,它将允许显示下载进度,并且如果需要,还可以取消该操作.

    The Response Stream ReadToEnd() method is also as left as is, but you should modify it to read a buffer. It would allow to show the download progress, for example, and also to cancel the operation, if required.

    重要:不能将UserAgent设置为任何现有浏览器的最新版本.某些网站在检测到用户代理支持 HSTS协议时,将激活它并等待进行互动. WebRequest对HSTS一无所知,并且将超时.我将UserAgent设置为Internet Explorer11.它适用于所有站点.

    Important: the UserAgent cannot be set to a recent version of any existing Browser. Some web sites, when detect that a User Agent supports the HSTS protocol, will activate it and wait for interaction. WebRequest knows nothing about HSTS and will timeout. I set the UserAgent to Internet Explorer 11. It works fine with all sites.

    一个建议:此类将受益于HttpWebRequest方法的async版本:您将能够发出多个并发请求,而不必等待每个请求都同步完成.
    只需要进行一些修改就可以将该类转换为异步版本.

    A suggestion: this class would benefit from the async version of the HttpWebRequest methods: you'ld be able to issue a number of concurrent requests instead of waiting each and all of them to complete synchronously.
    Only a few modifications are required to turn this class into an async version.

    此类现在应支持大多数不使用脚本异步构建内容的HTML页面.
    如评论中所述,惰性 HttpClient可以处理其中的部分(并非全部)页面,但需要完全不同的设置.

    This class should now support most Html pages that don't use Scripts to build the content asynchronously.
    As already described in comments, a Lazy HttpClient can handle some (not all) of these pages, but it requires a completely different setup.

    Imports System
    Imports System.IO
    Imports System.Net
    Imports System.Net.Security
    Imports System.Security.Cryptography.X509Certificates
    Imports System.Text
    
    Public Class WebRequestHelper
        Private m_ResponseUri As Uri
        Private m_StatusCode As HttpStatusCode
        Private m_StatusDescription As String
        Private m_ContentSize As Long
        Private m_WebException As WebExceptionStatus
        Public Property SiteCookies As CookieContainer
        Public Property UserAgent As String = "Mozilla / 5.0(Windows NT 6.1; WOW32; Trident / 7.0; rv: 11.0) like Gecko"
        Public Property Timeout As Integer = 30000
        Public ReadOnly Property ContentSize As Long
            Get
                Return m_ContentSize
            End Get
        End Property
    
        Public ReadOnly Property ResponseUri As Uri
            Get
                Return m_ResponseUri
            End Get
        End Property
    
        Public ReadOnly Property StatusCode As Integer
            Get
                Return m_StatusCode
            End Get
        End Property
    
        Public ReadOnly Property StatusDescription As String
            Get
                Return m_StatusDescription
            End Get
        End Property
    
        Public ReadOnly Property WebException As Integer
            Get
                Return m_WebException
            End Get
        End Property
    
    
        Sub New()
            SiteCookies = New CookieContainer()
        End Sub
    
        Public Function GetSiteResponse(ByVal siteUri As Uri) As String
            Dim response As String = String.Empty
    
            ServicePointManager.DefaultConnectionLimit = 50
            Dim maxFWValue As SecurityProtocolType = System.Enum.GetValues(GetType(SecurityProtocolType)).OfType(Of SecurityProtocolType)().Max()
            ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 Or SecurityProtocolType.Tls12 Or maxFWValue
            ServicePointManager.ServerCertificateValidationCallback = AddressOf TlsValidationCallback
    
            Dim Http As HttpWebRequest = WebRequest.CreateHttp(siteUri.ToString)
            With Http
                .Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
                .AllowAutoRedirect = True
                .AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate
                .CookieContainer = Me.SiteCookies
                .Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate")
                .Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.7")
                .Headers.Add(HttpRequestHeader.CacheControl, "no-cache")
                .KeepAlive = True
                .MaximumAutomaticRedirections = 50
                .ServicePoint.Expect100Continue = False
                .ServicePoint.MaxIdleTime = Me.Timeout
                .Timeout = Me.Timeout
                .UserAgent = Me.UserAgent
            End With
    
            Try
                Using webResponse As HttpWebResponse = DirectCast(Http.GetResponse, HttpWebResponse)
                    Me.m_ResponseUri = webResponse.ResponseUri
                    Me.m_StatusCode = webResponse.StatusCode
                    Me.m_StatusDescription = webResponse.StatusDescription
                    Dim contentLength As String = webResponse.Headers.Get("Content-Length")
                    Me.m_ContentSize = If(String.IsNullOrEmpty(contentLength), 0, Convert.ToInt64(contentLength))
    
                    Using responseStream As Stream = webResponse.GetResponseStream()
                        If webResponse.StatusCode = HttpStatusCode.OK Then
                            Dim reader As StreamReader = New StreamReader(responseStream, Encoding.Default)
                            Me.m_ContentSize = webResponse.ContentLength
                            response = reader.ReadToEnd()
                            Me.m_ContentSize = If(Me.m_ContentSize = -1, response.Length, Me.m_ContentSize)
                        End If
                    End Using
                End Using
            Catch exW As WebException
                If exW.Response IsNot Nothing Then
                    Me.m_StatusCode = CType(exW.Response, HttpWebResponse).StatusCode
                End If
                Me.m_StatusDescription = "WebException: " & exW.Message
                Me.m_WebException = exW.Status
            End Try
            Return response
        End Function
    
        Private Function TlsValidationCallback(sender As Object, CACert As X509Certificate, CAChain As X509Chain, SslPolicyErrors As SslPolicyErrors) As Boolean
            If SslPolicyErrors = SslPolicyErrors.None Then Return True
            Dim Certificate As New X509Certificate2(CACert)
    
            CAChain.Build(Certificate)
            For Each CACStatus As X509ChainStatus In CAChain.ChainStatus
                If (CACStatus.Status <> X509ChainStatusFlags.NoError) And
                    (CACStatus.Status <> X509ChainStatusFlags.UntrustedRoot) Then
                    Return False
                End If
            Next
            Return True
        End Function
    
    End Class
    

    这篇关于使HTML for Agilegy Pack处理HTML的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆