获取文本关闭的网页(不是HTML源代码) [英] getting text off webpage (NOT HTML SOURCE)

查看:70
本文介绍了获取文本关闭的网页(不是HTML源代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将我放在网页的内容转换为字符串?

how would i put the contents of a webpage into a string?

这将是同样的事情,按下ctrl + A复制和粘贴。

it would be the same thing as hitting ctrl+A and copying and pasting it.

有没有办法做到这一点编程没有'的SendKeys?

is there a way to do this programmatically without 'sendkeys' ?

我不想看的HTML源代码在所有的,我只是想复制的文本上的网站

i do not want to look at the html source at all, i just want to copy the text on the site

推荐答案

我已经做了屏幕的公平位刮的应用程序,并发现这是非常宝贵的: https://github.com/MindTouch/SGMLReader

I've done a fair bit of screen scraping for applications and have found this to be invaluable: https://github.com/MindTouch/SGMLReader

有一点是网页上的样品code,但我已经添加了一些额外的这里,将返回正是你想要的。

There is a bit of sample code on that page but I've added a bit extra here that will return exactly what you want

Imports System.Xml
Imports System.IO
Imports System.Net
Imports System.Text

Function FromHtml(ByVal reader As TextReader) As XmlDocument
    '' setup SgmlReader   
    Dim sgmlReader As Sgml.SgmlReader = New Sgml.SgmlReader()
    sgmlReader.DocType = "HTML"
    sgmlReader.WhitespaceHandling = WhitespaceHandling.None
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
    sgmlReader.InputStream = reader
    '' create document 
    Dim doc As XmlDocument = New XmlDocument()
    doc.PreserveWhitespace = True
    doc.XmlResolver = Nothing
    doc.Load(sgmlReader)
    Return doc
End Function

Function LoadWebText(ByVal URL As String) As String
    Dim objWebClient As New WebClient()
    Dim objUTF8 As New UTF8Encoding()

    Dim xml As New XmlDocument
    xml = FromHtml(New StringReader(objUTF8.GetString(objWebClient.DownloadData(URL))))

    Return xml.InnerText()

End Function

这篇关于获取文本关闭的网页(不是HTML源代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆