获取文本关闭的网页(不是HTML源代码) [英] getting text off webpage (NOT HTML SOURCE)
问题描述
如何将我放在网页的内容转换为字符串?
how would i put the contents of a webpage into a string?
这将是同样的事情,按下ctrl + A复制和粘贴。
it would be the same thing as hitting ctrl+A and copying and pasting it.
有没有办法做到这一点编程没有'的SendKeys?
is there a way to do this programmatically without 'sendkeys' ?
我不想看的HTML源代码在所有的,我只是想复制的文本上的网站
i do not want to look at the html source at all, i just want to copy the text on the site
推荐答案
我已经做了屏幕的公平位刮的应用程序,并发现这是非常宝贵的: https://github.com/MindTouch/SGMLReader
I've done a fair bit of screen scraping for applications and have found this to be invaluable: https://github.com/MindTouch/SGMLReader
有一点是网页上的样品code,但我已经添加了一些额外的这里,将返回正是你想要的。
There is a bit of sample code on that page but I've added a bit extra here that will return exactly what you want
Imports System.Xml
Imports System.IO
Imports System.Net
Imports System.Text
Function FromHtml(ByVal reader As TextReader) As XmlDocument
'' setup SgmlReader
Dim sgmlReader As Sgml.SgmlReader = New Sgml.SgmlReader()
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = WhitespaceHandling.None
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = reader
'' create document
Dim doc As XmlDocument = New XmlDocument()
doc.PreserveWhitespace = True
doc.XmlResolver = Nothing
doc.Load(sgmlReader)
Return doc
End Function
Function LoadWebText(ByVal URL As String) As String
Dim objWebClient As New WebClient()
Dim objUTF8 As New UTF8Encoding()
Dim xml As New XmlDocument
xml = FromHtml(New StringReader(objUTF8.GetString(objWebClient.DownloadData(URL))))
Return xml.InnerText()
End Function
这篇关于获取文本关闭的网页(不是HTML源代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!