用VB.Net进行HTML解析 [英] HTML parsing with VB.Net

查看:419
本文介绍了用VB.Net进行HTML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道VB.Net中解析HTML的好方法.

我使用MSHTML在网上找到了一个解决方案.

我尝试使用它并使它工作,我不得不在项目中添加一些代码:

Public Enum HRESULT
    S_OK = 0
    S_FALSE = 1
    E_NOTIMPL = &H80004001
    E_INVALIDARG = &H80070057
    E_NOINTERFACE = &H80004002
    E_FAIL = &H80004005
    E_UNEXPECTED = &H8000FFFF
End Enum

<ComVisible(True), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _
    InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersistStreamInit : Inherits IPersist
    Shadows Sub GetClassID(ByRef pClassID As Guid)
    <PreserveSig()> Function IsDirty() As Integer
    <PreserveSig()> Function Load(ByVal pstm As UCOMIStream) As HRESULT
    <PreserveSig()> Function Save(ByVal pstm As UCOMIStream, _
        <MarshalAs(UnmanagedType.Bool)> ByVal fClearDirty As Boolean) As HRESULT
    <PreserveSig()> Function GetSizeMax(<InAttribute(), Out(), _
    MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As HRESULT
    <PreserveSig()> Function InitNew() As HRESULT
End Interface

<ComVisible(True), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), _
    InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersist
    Sub GetClassID(ByRef pClassID As Guid)
End Interface

Declare Function CreateStreamOnHGlobal Lib "ole32" (ByVal hGlobal As IntPtr, ByVal fDeleteOnRelease As Boolean, _
    ByRef ppstm As UCOMIStream) As Long
' Please note that i copied above IPersistStream definition from sp!ke. I owe him a drink ;). 

结束班级

现在我收到了过时的警告:

Warning 1   'System.Runtime.InteropServices.UCOMIStream' is obsolete: 'Use System.Runtime.InteropServices.ComTypes.IStream instead. http://go.microsoft.com/fwlink/?linkid=14202'.

我不喜欢必须使用MSHTML的事实(因为我认为IE也使用它,而且我们都知道IE很烂:)),而且我必须添加代码才能使其正常工作. /p>

不想启动浏览器大战线程,所以忽略了我的最后一句话. :)

在VB.Net中是否有另一种(更好的)解析html页面的方法.

基本上,我想做的是在页面上获得所有链接(<a>标签)和嵌入(<object>标签).

在此先感谢您的帮助!

解决方案

您可以使用 HTML Agility Pack

Does anybody know of a good way of parsing HTML in VB.Net.

I found a solution somewhere on the net using MSHTML.

I tried using it and to get it to work I had to add some code to my project:

Public Enum HRESULT
    S_OK = 0
    S_FALSE = 1
    E_NOTIMPL = &H80004001
    E_INVALIDARG = &H80070057
    E_NOINTERFACE = &H80004002
    E_FAIL = &H80004005
    E_UNEXPECTED = &H8000FFFF
End Enum

<ComVisible(True), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _
    InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersistStreamInit : Inherits IPersist
    Shadows Sub GetClassID(ByRef pClassID As Guid)
    <PreserveSig()> Function IsDirty() As Integer
    <PreserveSig()> Function Load(ByVal pstm As UCOMIStream) As HRESULT
    <PreserveSig()> Function Save(ByVal pstm As UCOMIStream, _
        <MarshalAs(UnmanagedType.Bool)> ByVal fClearDirty As Boolean) As HRESULT
    <PreserveSig()> Function GetSizeMax(<InAttribute(), Out(), _
    MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As HRESULT
    <PreserveSig()> Function InitNew() As HRESULT
End Interface

<ComVisible(True), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), _
    InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersist
    Sub GetClassID(ByRef pClassID As Guid)
End Interface

Declare Function CreateStreamOnHGlobal Lib "ole32" (ByVal hGlobal As IntPtr, ByVal fDeleteOnRelease As Boolean, _
    ByRef ppstm As UCOMIStream) As Long
' Please note that i copied above IPersistStream definition from sp!ke. I owe him a drink ;). 

End Class

And now I'm getting obsolete warnings:

Warning 1   'System.Runtime.InteropServices.UCOMIStream' is obsolete: 'Use System.Runtime.InteropServices.ComTypes.IStream instead. http://go.microsoft.com/fwlink/?linkid=14202'.

I didn't like the fact that I had to use the MSHTML stuff (cause I think IE uses it also, and we all know that IE sucks :) ) and that I had to add code the make it work.

Don't want to start a browser-war thread so neglect my last remark. :)

Is there a different (/better) approach of parsing html-pages in VB.Net.

Basically what I'm trying to do is get all the links (<a> tags) and embeds (<object> tag) on a page.

Thanks in advance for all you help!

解决方案

You can use HTML Agility Pack

这篇关于用VB.Net进行HTML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆