循环浏览网页并复制数据 [英] Cycle through webpages and copy data

查看:123
本文介绍了循环浏览网页并复制数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为一个朋友创建了此脚本,该朋友在房地产网站中循环浏览并为她获取电子邮件地址(用于晋升).该网站免费提供它们,但是一次抓取一个是不方便的.第一个脚本将每个页面数据转储到一个名为webdump的txt文件中,第二个脚本从第一个txt文件中提取电子邮件地址.将每个文件保存在单独的.vbs文件中.如果要测试脚本,则可能需要将以下内容更改为较小的数字(这是处理的页面数):

I created this script for a friend that cycles through a real estate website and snags email address for her (for promotion). The site offers them freely, but it's inconvenient to grab one at a time. The first script dumps each pages data into a txt file called webdump and the second extracts the email addresses from the first txt file. Save each of these in a separate .vbs file. If you want to test the script, you may want to change the following to a lower number (this is how many pages are processed):

Do while i < 1334

第一个错误输入,我不确定为什么,第二个错误不仅仅只是电子邮件地址,还不完全是为什么.我不是一个熟练的vbs家伙,但是那些问题与我的问题无关...底部的问题...

The first one errors a ways in and I'm not totally sure why and the second one pulls out a little more than just the email addresses and again, not totally sure why. I'm not a highly skilled vbs guy, but those issues aren't related to my question... Question at the bottom...

set ie = createobject("internetexplorer.application") 
Set objShell = CreateObject("WScript.Shell")
Dim i
i = 0

Do while i < 1334
i = i + 1

ie.navigate "http://www.reoagents.net/search-3.php?category=1&firmname=&business=&address=&zip=&phone=&fax=&mobile=&im=&manager=&mail=&www=&reserved_1=&reserved_2=&reserved_3=&filterbyday=ANY&loc_one=&loc_two=&loc_three=&loc_four=&location_text=&page="&i
do until ie.readystate = 4 : wscript.sleep 10: loop 

pageText = ie.document.body.innertext 

set fso = createobject("scripting.filesystemobject") 
set ts = fso.opentextfile("c:\webdump.txt",8,true) 
ts.write pageText 
ts.close 

loop

Wscript.Echo "All site data copied!"

第二部分:

Const ForReading = 1
Const ForWriting = 8

Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "@"

Set objFSO = CreateObject("Scripting.FileSystemObject")

'Input file
Set objFileIn = objFSO.OpenTextFile("C:\webdump.txt", ForReading)
strOutputFile = "C:\cleanaddress.txt"

Do Until objFileIn.AtEndOfStream
strSearchString = objFileIn.ReadLine
Set colMatches = objRegEx.Execute(strSearchString)  
If colMatches.Count > 0 Then
    For Each strMatch in colMatches 
' Output File
Set objFileOut = objFSO.OpenTextFile(strOutputFile, ForWriting, True)  

IF InStr(strSearchString," ") = 0 THEN
objFileOut.writeline strSearchString
ELSE
objFileOut.writeline Left(strSearchString,InStr(strSearchString," ")-1)


    END IF
    objFileOut.Close
    Set objFileOut = Nothing

    Next
End If
Loop

objFileIn.Close
Wscript.Echo "Done!"

由于地址的方式,我能够轻松地浏览该站点上的页面...地址的最后一个数字是连续的,但是,现在,我想使用此地址进行尝试:

I'm able to cycle through the pages on that site easily because of the way the address is...last number of address is sequential, however, now I want to try it with this address:

https://netforum.avectra.com/eweb/DynamicPage.aspx?Site=NEFAR&WebCode=IndResult&FromSearchControl=是& FromSearchControl =是

这似乎是基于Java的.当我单击每个页面时,地址不会更改.在这种情况下,是否可以做与我在其他站点上所做的类似的事情?

which seems to be java based. When I click through each page, the address doesn't change. Is it possible to do something similar to what I've done on the other site in this case?

推荐答案

虽然不完整,不是最佳的,也不是没有错误的,这可能会有所帮助:

Although not complete, not optimal, not bugfree, this could help:

' VB Script Document
option explicit

Dim strResult: strResult = Wscript.ScriptName
Dim numResult: numResult = 0
Dim ii, IE, pageText, fso, ts, xLink, Links

  set fso = createobject("scripting.filesystemobject") 
  set ts = fso.opentextfile("d:\bat\files\28384650_webdump.txt",8,true) 

  set IE = createobject("internetexplorer.application") 

  'read first page
  IE.navigate "https://netforum.avectra.com/eweb/DynamicPage.aspx?Site=NEFAR&WebCode=IndResult&FromSearchControl=Yes&FromSearchControl=Yes"
  IE.Visible = True

For ii = 1 to 3 '239
  ts.writeLine "-----------------" & ii
  strResult = strResult & vbNewLine & ii

  While IE.Busy
    Wscript.Sleep 100
  Wend
  While IE.ReadyState <> 4
    Wscript.Sleep 100
  Wend
  While IE.document.readystate <> "complete" 
      wscript.sleep 100
  Wend
  WScript.Sleep 100

  pageText = IE.document.body.innertext
  ts.writeLine pageText

  ' get sublinks and collect them in the 'strResult' variable
  Set Links = IE.document.getElementsByTagName("a")
  For Each xLink In Links
    If InStr(1, xLink.href, "WebCode=PrimaryContactInfo" _
      , vbTextCompare) > 0 Then
      If InStr(1, strResult, xLink.href, vbTextCompare) > 0 Then
      Else
        numResult = numResult + 1
        strResult = strResult & vbNewLine & xLink.href
      End If
    End If
  Next

  ' read a page of the 'ii' index
  IE.Navigate "javascript:window.__doPostBack('JumpToPage','" & ii+1 & "');"
  IE.Visible = True
Next

  ts.writeLine "===========" & numResult & vbTab & strResult
  ts.close 

Wscript.Echo "All site data copied! " _
    & numResult & vbNewline & strResult
Wscript.Quit

说明:

  • 使用通常的http地址导航到第一页
  • 通过javascript ... __doPostBack调用导航到下一个页面(属于ii+1索引)(就像一个用户实现跳转到页面一样)字段,然后点击 GO 按钮)
  • 未完成:收集(间接)指向主要联系信息的链接 无法找到电子邮件地址的网页<​​/li>
  • 并非最佳:不断收集访问过的网页的文字
  • 并非没有错误:

  • navigates to first page with usual http(s) address
  • navigates to next pages (of the ii+1 index) with javascript ... __doPostBack call (the same as if one fulfill Jump to Page field and click the GO button)
  • not complete: collects (indirect) links to Primary Contact Info webpages where e-mail addresses could be found without getting them
  • not optimal: keeps collecting text of pages visited
  • not bugfree:

  • 可以与新清除的 MSIE 临时文件配合使用, 历史记录和cookie;否则以奇数(最后访问?)的 page 开始 netforum.avectra.com
  • 导航到第ii+1 ,因此最后一页失败.
  • works fine with freshly cleared MSIE temporary files, history and cookies; otherwise starts at an odd (last visited?) page of netforum.avectra.com
  • navigates to ii+1th page, so fails on the last one.

这篇关于循环浏览网页并复制数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆