循环浏览网页并复制数据 [英] Cycle through webpages and copy data
问题描述
我为一个朋友创建了此脚本,该朋友在房地产网站中循环浏览并为她获取电子邮件地址(用于晋升).该网站免费提供它们,但是一次抓取一个是不方便的.第一个脚本将每个页面数据转储到一个名为webdump的txt文件中,第二个脚本从第一个txt文件中提取电子邮件地址.将每个文件保存在单独的.vbs文件中.如果要测试脚本,则可能需要将以下内容更改为较小的数字(这是处理的页面数):
I created this script for a friend that cycles through a real estate website and snags email address for her (for promotion). The site offers them freely, but it's inconvenient to grab one at a time. The first script dumps each pages data into a txt file called webdump and the second extracts the email addresses from the first txt file. Save each of these in a separate .vbs file. If you want to test the script, you may want to change the following to a lower number (this is how many pages are processed):
Do while i < 1334
第一个错误输入,我不确定为什么,第二个错误不仅仅只是电子邮件地址,还不完全是为什么.我不是一个熟练的vbs家伙,但是那些问题与我的问题无关...底部的问题...
The first one errors a ways in and I'm not totally sure why and the second one pulls out a little more than just the email addresses and again, not totally sure why. I'm not a highly skilled vbs guy, but those issues aren't related to my question... Question at the bottom...
set ie = createobject("internetexplorer.application")
Set objShell = CreateObject("WScript.Shell")
Dim i
i = 0
Do while i < 1334
i = i + 1
ie.navigate "http://www.reoagents.net/search-3.php?category=1&firmname=&business=&address=&zip=&phone=&fax=&mobile=&im=&manager=&mail=&www=&reserved_1=&reserved_2=&reserved_3=&filterbyday=ANY&loc_one=&loc_two=&loc_three=&loc_four=&location_text=&page="&i
do until ie.readystate = 4 : wscript.sleep 10: loop
pageText = ie.document.body.innertext
set fso = createobject("scripting.filesystemobject")
set ts = fso.opentextfile("c:\webdump.txt",8,true)
ts.write pageText
ts.close
loop
Wscript.Echo "All site data copied!"
第二部分:
Const ForReading = 1
Const ForWriting = 8
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "@"
Set objFSO = CreateObject("Scripting.FileSystemObject")
'Input file
Set objFileIn = objFSO.OpenTextFile("C:\webdump.txt", ForReading)
strOutputFile = "C:\cleanaddress.txt"
Do Until objFileIn.AtEndOfStream
strSearchString = objFileIn.ReadLine
Set colMatches = objRegEx.Execute(strSearchString)
If colMatches.Count > 0 Then
For Each strMatch in colMatches
' Output File
Set objFileOut = objFSO.OpenTextFile(strOutputFile, ForWriting, True)
IF InStr(strSearchString," ") = 0 THEN
objFileOut.writeline strSearchString
ELSE
objFileOut.writeline Left(strSearchString,InStr(strSearchString," ")-1)
END IF
objFileOut.Close
Set objFileOut = Nothing
Next
End If
Loop
objFileIn.Close
Wscript.Echo "Done!"
由于地址的方式,我能够轻松地浏览该站点上的页面...地址的最后一个数字是连续的,但是,现在,我想使用此地址进行尝试:
I'm able to cycle through the pages on that site easily because of the way the address is...last number of address is sequential, however, now I want to try it with this address:
这似乎是基于Java的.当我单击每个页面时,地址不会更改.在这种情况下,是否可以做与我在其他站点上所做的类似的事情?
which seems to be java based. When I click through each page, the address doesn't change. Is it possible to do something similar to what I've done on the other site in this case?
推荐答案
虽然不完整,不是最佳的,也不是没有错误的,这可能会有所帮助:
Although not complete, not optimal, not bugfree, this could help:
' VB Script Document
option explicit
Dim strResult: strResult = Wscript.ScriptName
Dim numResult: numResult = 0
Dim ii, IE, pageText, fso, ts, xLink, Links
set fso = createobject("scripting.filesystemobject")
set ts = fso.opentextfile("d:\bat\files\28384650_webdump.txt",8,true)
set IE = createobject("internetexplorer.application")
'read first page
IE.navigate "https://netforum.avectra.com/eweb/DynamicPage.aspx?Site=NEFAR&WebCode=IndResult&FromSearchControl=Yes&FromSearchControl=Yes"
IE.Visible = True
For ii = 1 to 3 '239
ts.writeLine "-----------------" & ii
strResult = strResult & vbNewLine & ii
While IE.Busy
Wscript.Sleep 100
Wend
While IE.ReadyState <> 4
Wscript.Sleep 100
Wend
While IE.document.readystate <> "complete"
wscript.sleep 100
Wend
WScript.Sleep 100
pageText = IE.document.body.innertext
ts.writeLine pageText
' get sublinks and collect them in the 'strResult' variable
Set Links = IE.document.getElementsByTagName("a")
For Each xLink In Links
If InStr(1, xLink.href, "WebCode=PrimaryContactInfo" _
, vbTextCompare) > 0 Then
If InStr(1, strResult, xLink.href, vbTextCompare) > 0 Then
Else
numResult = numResult + 1
strResult = strResult & vbNewLine & xLink.href
End If
End If
Next
' read a page of the 'ii' index
IE.Navigate "javascript:window.__doPostBack('JumpToPage','" & ii+1 & "');"
IE.Visible = True
Next
ts.writeLine "===========" & numResult & vbTab & strResult
ts.close
Wscript.Echo "All site data copied! " _
& numResult & vbNewline & strResult
Wscript.Quit
说明:
- 使用通常的
http
地址导航到第一页 - 通过
javascript
...__doPostBack
调用导航到下一个页面(属于ii+1
索引)(就像一个用户实现跳转到页面一样)字段,然后点击GO
按钮) - 未完成:收集(间接)指向主要联系信息的链接 无法找到电子邮件地址的网页</li>
- 并非最佳:不断收集访问过的网页的文字
-
并非没有错误:
- navigates to first page with usual
http
(s) address - navigates to next pages (of the
ii+1
index) withjavascript
...__doPostBack
call (the same as if one fulfill Jump to Page field and click theGO
button) - not complete: collects (indirect) links to Primary Contact Info webpages where e-mail addresses could be found without getting them
- not optimal: keeps collecting text of pages visited
not bugfree:
- 可以与新清除的 MSIE 临时文件配合使用, 历史记录和cookie;否则以奇数(最后访问?)的 page 开始 netforum.avectra.com
- 导航到第
ii+1
页,因此最后一页失败.
- works fine with freshly cleared MSIE temporary files, history and cookies; otherwise starts at an odd (last visited?) page of netforum.avectra.com
- navigates to
ii+1
th page, so fails on the last one.
这篇关于循环浏览网页并复制数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!