从可能不被格式化为表格的网页提取数据 [英] Extract data from a web page that may not be formatted as a table

查看:94
本文介绍了从可能不被格式化为表格的网页提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于初学者,我绝对不是VBA的专家。只要知道足够危险8)。



我开始搜索如何从网页中提取表格,看到很多人都问同样的问题。不幸的是,我正在阅读的大部分是在我的头上。我阅读的一篇文章指出了我这个



所以基本和我的出发点是简单地从1组中提取团队列表,并将其粘贴到没有格式化的excel页面中。基本上面的黄色区域在上面的图像。该图像不能适合整个页面,但在这个组中实际上有10个团队。不过,我想让它变得有变数,因为有时候你可能会有不少于10个团队。在这一点我要假定行数是一个小问题。



一旦我得到了这个部分,我希望能够相对容易地切换到下一个组,抓住团队和结果列表,并将它们添加到最后我正在excel中建立的列表。在网页上,可以通过选择蓝色区域来完成。



现在一旦我有了这两件事情,我将需要从头开始,再次从绿色建立这个列表,并将该列表放在新页面。我有一些想法,如何实现这一点,但这将取决于前两个步骤的样子。



我也有自己的奖金任务,这是拉一组中每个团队的时间表,以了解他们对各种其他团队的影响。谁打败谁打交易。我希望我可以根据从上面的任务获得的信息来弄清楚部分。



所以我很确定还有其他语言/ prgs更适合手头的任务,但我想坚持我所拥有的...和我到目前为止所知道的一点。所以我尝试了一些VBA代码,并评论了我需要实现的。到目前为止,我想我已经打开了网页!并在如何做一些事情的评论中构建了一个思想过程。

  Sub GetTeamData()
Dim IE As Object
Dim roundcounter As Integer
Dim groupcounter As Integer
Dim TeamList As Variant
Dim WebAddress As String
Dim Number_of_rounds as Integer
Dim Number_of_Groups as Integer

'设置网站地址链接到
WebAddress =http://worldoftanks.com/en/tournaments/1000000017/

设置IE = CreateObject(InternetExplorer.Application)

与IE
.Visible = True
.navigate(WebAddress)
结束

这块代码做了什么?等待网页完成加载?
虽然IE.readyState<> 4
DoEvents
Wend

'为循环设置初始参数。我现在可以硬编码了。
Number_of_groups = 125
Number_of_rounds = 5

'开始提取teamdata

'对于roundcounter = 1到number_of_rounds
'选择在网页上的roundcounter
'for groupcounter = 1 to number_of_groups
'在网页上选择groupcounter
'6-10队(位置,队名,战斗,胜利,损失,关系和积分)的抓取表
'将表添加到TeamList
'next groupcounter
'粘贴TeamList to sheet roundcounter cell A1
'clear TeamList
'next roundcounter


'下一个任务
'基于如何拉组表日期的结果,拉个人团队计划结果构建矩阵结果

设置IE =无

End Sub

我正在考虑的一件事是,而不是使用带有计数器的下一个循环如果在发生错误之前设置循环,直到发生错误为止,就会更容易,因为超过了数目ps或者回合。现在我正在漫游。



无论如何,如果有人会很好,让我开始如何从上面的图像拉出黄色区域将非常感谢!请温柔我确实意识到这个问题已经被问了很多次...我只是不明白我在读什么。另外如果这是不可能或非常难做,请让我知道。感谢您提前协助教育我。



更新16/03/19 0900



所以我今天早上再次尝试从Web进程获取数据,但是运气不错...但不是很多。





在1个错误窗口之后,我点击是,我得到加载的网页





我有一个黄色的黄色箭头,在左上角的页面上显示一次。所以我尝试了它,它提供了信息。





但我确实注意到没有





当我提供信息时,这不是我正在寻找的信息。当我扫描结果时,我可以看到我正在寻找的数据应该在哪里,但是所有的结果都是丢失的,只是表列标题显示在大约263行中。





所以我尝试从网页上复制并粘贴方法,使用select all网页上的副本。对于粘贴,我尝试了不同的方法。保持源格式化没有产生任何结果。保持目的地格式化带来的信息。我尝试粘贴特殊(html,Unicode和文本)HTML使事情看起来漂亮,其他两个将所有内容放在一个列中。更重要的是结果在表格中。





现在如果我只需要第1组第1组队列表和结果,我可以用这个。只需删除表格上方和下方的所有行瞧!然而,由于每个组和每一轮的网址是相同的,我不知道如何在蓝色或绿色区域上点击更新信息。如果我知道这一点,我可以通过复制和粘贴每个页面来自动执行该过程,然后将结果编辑到表格中,并将表格移动到最后一个结果下方的另一个表格。



对我来说似乎应该有一个更好的方法。



16/03/19 1600



 <! -  ko if:visibleBracketType()=== ROUND_ROBIN  - >< table class =tournament-table tournament-table__indentcellpadding =0cellspacing =0> < tr class =tournament-table_tr> < th class =tournament-table_th tournament-table_th__numb>#< / th> < th class =tournament-table_th> < div class =tournament-table_ico-holder> < span class =ico-team>团队< / span> < / DIV> < div class =tournament-table_heading-text>团队< / div> < /第> < th class =tournament-table_th> < div class =tournament-table_ico-holder> < span class =ico-battles>战斗< / span> < / DIV> < div class =tournament-table_heading-text>战斗< / div> < /第> < th class =tournament-table_th> < div class =tournament-table_ico-holder> < span class =ico-victory>胜利< / span> < / DIV> < div class =tournament-table_heading-text>胜利< / div> < /第> < th class =tournament-table_th tournament-table_th__mobile-hide> < div class =tournament-table_ico-holder> < span class =ico-flag> Defeats< / span> < / DIV> < div class =tournament-table_heading-text> defeats< / div> < /第> < th class =tournament-table_th tournament-table_th__mobile-hide> < div class =tournament-table_ico-holder> < span class =ico-division>绘制< / span> < / DIV> < div class =tournament-table_heading-text>绘制< / div> < /第> < th class =tournament-table_th> < div class =tournament-table_ico-holder> < span class =ico-points> Points< / span> < / DIV> < div class =tournament-table_heading-text>点< / div> < /第> < / TR> <! -  ko foreach:{data:rrBrackets()。teams,as:'team'}  - > < tr class =tournament-table_trdata-bind =css:{'tournament-table_tr__my-team':team.team_id === $ root.currentUserTeamIdInCurrentGroup()}> < td class =tournament-table_tddata-bind =text:team.position>< / td> < td class =tournament-table_tddata-bind =css:{'tournament-table_td__my-team':team.team_id === $ root.currentUserTeamIdInCurrentGroup()}> < a class =tournament-table_team tournament-table_team__bigtarget =_ blankdata-bind =text:team.team_title,attr:{href:$ root.getTournamentTeamUrl(team.team_id)}>< /一个> < / TD> < td class =tournament-table_tddata-bind =text:team.battle_played>< / td> < td class =tournament-table_tddata-bind =text:team.wins>< / td> < td class =tournament-table_td tournament-table_td__mobile-hidedata-bind =text:team.losses>< / td> < td class =tournament-table_td tournament-table_td__mobile-hidedata-bind =text:team.draws>< / td> < td class =tournament-table_tddata-bind =text:team.extra_statistics.points>< / td> < / TR> <! -  / ko  - >< / table>  


$ b $好吧,从我正在阅读的各种帖子和我一直在观看的视频中收集到的内容,我需要在网页的编码中找到一些关键的标签,从那里我可以最终开始拉数据。我在IE上点击F12查看代码,然后在代码区域中,我搜索了我正在查找的区域中的一些显示文本,并发现了上面的代码块。有很多GUESSING我希望我抓住了正确的一块。现在来弄清楚这个关键标签是什么,以及如何使用它。顺便说一下,这个网页的代码是什么?

解决方案

所以如果写了一个小Sub,我认为应该解决你的问题如果我正确理解你的话当然,你会投入一些工作,因为现在只读一个阶段。但是它读取每个组中的数据:

  Option Explicit 

Private Sub CommandButton1_Click()

'确保添加对Microsoft Internet Controls(shdocvw.dll)和
'Microsoft HTML对象库的引用。
'代码不会运行。

Dim objIE As SHDocVw.InternetExplorer'microsoft internet controls(shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection

Set objIE = New SHDocVw.InternetExplorer

Dim htmlCurrentDoc As MSHTML.HTMLDocument'Microsoft HTML Object Library

Dim RowNumber As Integer
RowNumber = 1

With objIE
.Navigatehttp://worldoftanks.com/en/tournaments/1000000017/'主页
.Visible = 0
Do While .READYSTATE<> 4:DoEvents:Loop
Application.Wait(Now + TimeValue(0:00:01))


设置htmlDoc = .document

Dim ButtonRoundData As Variant
设置ButtonRoundData = htmlDoc.getElementsByClassName(group-stage_link)

Dim ButtonData As Variant
设置ButtonData = htmlDoc.getElementsByClassName(groups_link)



Dim按钮作为HTMLLinkElement
对于ButtonData中的每个按钮

Debug.Print button.nodeName

button.Click

Application.Wait(Now + TimeValue(0:00:02))'这是为了防止双重进入,但它不干净。你应该确定检查表是否仍然相同,然后等待

设置htmlCurrentDoc = .document
Dim RawData As HTMLTable
设置RawData = htmlCurrentDoc.getElementsByClassName(tournament-table (0)



Dim ColumnNumber As Integer
ColumnNumber = 1

Dim hRow As HTMLTableRow
Dim hCell As HTMLTableCell
对于每个hRow在RawData.Rows

对于每个hCell在hRow.Cells
单元格(RowNumber,ColumnNumber).Value = hCell.innerText
ColumnNumber = ColumnNumber + 1
下一个hCell
ColumnNumber = 1
RowNumber = RowNumber + 1
下一个hRow

RowNumber = RowNumber + 3
下一步按钮
结束

结束子

它的作用是启动一个不可见的IE,读取数据,点击按钮,读取下一个等等...



调试我建议设置。可以看到1,所以你会发生什么。



编辑1:如果你得到一个debbuging错误,尝试中止并运行它再次,它肯定需要一些错误处理,如果网站没有加载正确。



编辑2:使它有点稳定,你应该真的要注意,因为网页需要一些时间才能加载,所以在写入数据之前,务必检查数据是否已更改。如果没有改变等待一秒钟,然后再试一次。



这里有一些我在Excel中的示例数据:




For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).

I started out by doing a search on how to extract a table from a web page and saw many many people have asked the same question. Unfortunately most of what I was reading was over my head. One article I read pointed me to this detailed article by Siddharth Rout, but alas I could not follow what was going on other than there are two methods internet explorer or some other methods. Since I only have IE11 installed and MS Office I would prefer to go the IE route.

I have encountered this problem several times in the past and have always dropped the project or done things manually. Today I thought I would try and learn how to do this and make my future life hopefully a little easier. As such I am going to use data from a gaming website since it mimics other things I have encountered in the past.

So today's (this week's..no this month's..I am an optimist!) project is to build a list of every team involved in a tournament and copy their results into excel. This would be akin to pulling cricket, hockey, baseball, soccer, or football stats. I tried using Excel's built in Get Data From Web process, but it did not identify the table on the web page.

The address for the web page is: http://worldoftanks.com/en/tournaments/1000000017/

and is in the image below

So the basics and my starting point is to simply pull the list of teams from 1 group and paste it in an excel page with no formatting. Basically the area in yellow in the image above. The image could not fit the whole page but there are actually 10 teams in this group. However I would like to make it variable as sometimes you may have more or less than 10 teams in a group. I am going to assume the number of rows is a minor issue at this point.

Once I get that part figured out I am hoping it will be relatively easy to switch to the next group, grab that list of teams and results and add them to the end of the list I am building in excel. On the web page this would be done by selecting the blue areas.

Now once I have those two things figured out I would need to build the list again from scratch based on the stage of the tournament areas in green and put that list on a new page. I have some ideas how to achieve this but it will really depend on what the previous two steps look like.

I have a bonus task for myself too which is to pull the schedule for each team in a group to see how they did against various other teams. Who beat who type deal. I am hoping I can figure that part out based on the information learned from the task above.

So I am pretty sure there are other languages/prgs that are better suited for the task at hand, but I would like to stick with what I have...and the little I know so far. So I tried a wee bit of VBA code and commented on what I need to achieve. So far I think I have opened the webpage! and built a bit a thought process in comments on how to do some of the things.

Sub GetTeamData()
Dim IE As Object
Dim roundcounter As Integer
Dim groupcounter As Integer
Dim TeamList As Variant
Dim WebAddress As String
Dim Number_of_rounds as Integer
Dim Number_of_Groups as Integer

'set webaddress of site to link to
WebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"

Set IE = CreateObject("InternetExplorer.Application")

With IE
    .Visible = True
    .navigate (WebAddress)
End With

'What does this chunk of code do?  Wait for webpage to finish loading?
While IE.readyState <> 4
    DoEvents
Wend

'set initial parameters for loops.  I am ok with hardcoding this for now.
Number_of_groups = 125
Number_of_rounds = 5

'start pulling teamdata

'For roundcounter = 1 To number_of_rounds
    'select roundcounter on webpage
    'for groupcounter = 1 to number_of_groups
        'select groupcounter on webpage
        'grab table of 6-10 teams (position, team name, battles, wins, losses, ties, and points)
        'add table to TeamList
    'next groupcounter
    'paste TeamList to sheet roundcounter cell A1
    'clear TeamList
'next roundcounter


'Next task
'based on results on how to pull group table date, pull individual team schedule results to build matrix result

Set IE = Nothing

End Sub

One thing I was thinking about was that instead of using for next loops with a counter is if it would be easier to set it up to do a loop until an error had occurred like exceeding the number of groups or rounds. Now I am rambling.

Anyhow if someone would be so kind to get me started on how to pull the yellow area from the image above that would be much appreciated! Please be gentle! I do realize that this question has been asked many a time... I just did not understand what I was reading. Also if this is not possible or extremely difficult to do please let me know. Thank you in advance for your assistance in educating me.

UPDATE 16/03/19 0900

So I tried the Get Data From Web process again this morning with a bit more luck...but not much.

after 1 error window which I click yes to I get the web page to load

I got the little yellow arrow to show up once on the page in the very top left corner. So I tried it and it did pull in information.

but I did notice there were no yellow boxes next to the table I want which makes me wonder if it is not a table.

When I did pull in information, it was not the information I was looking for. When I scanned through the results, I could see where the data I am looking for should be, but all the results are missing, just the table column headers show up in about Row 263 or so.

So then I tried doing a copy and paste method from the web page using select all for the copy on the web page. For the paste I tried different methods. keeping source formatting resulted in nothing. keep destination formatting brought in information. I tried paste special (html, Unicode and text) HTML made things look pretty and the other two put everything into a single column. More importantly the results were in the table.

Now if I only needed round 1 group 1 team list and results I could work with this. Simply delete all the rows above and below the table and voila! however since the web address is the same for every group and every round I have no idea how to "click" on the blue or green areas to update the info. If I knew this I could automate the process by copying and pasting each page, then editing the results to just the table, and moving the table to another sheet just below the last results.

To me there seems like there should be a better method.

16/03/19 1600

<!-- ko if: visibleBracketType() === ROUND_ROBIN -->
<table class="tournament-table tournament-table__indent" cellpadding="0" cellspacing="0">
  <tr class="tournament-table_tr">
    <th class="tournament-table_th tournament-table_th__numb">#</th>
    <th class="tournament-table_th">
      <div class="tournament-table_ico-holder">
        <span class="ico-team">Team</span>
      </div>
      <div class="tournament-table_heading-text">
        Team
      </div>
    </th>
    <th class="tournament-table_th">
      <div class="tournament-table_ico-holder">
        <span class="ico-battles">Battles</span>
      </div>
      <div class="tournament-table_heading-text">
        Battles
      </div>
    </th>
    <th class="tournament-table_th">
      <div class="tournament-table_ico-holder">
        <span class="ico-victory">Victories</span>
      </div>
      <div class="tournament-table_heading-text">
        Victories
      </div>
    </th>
    <th class="tournament-table_th tournament-table_th__mobile-hide">
      <div class="tournament-table_ico-holder">
        <span class="ico-flag">Defeats</span>
      </div>
      <div class="tournament-table_heading-text">
        Defeats
      </div>
    </th>
    <th class="tournament-table_th tournament-table_th__mobile-hide">
      <div class="tournament-table_ico-holder">
        <span class="ico-division">Draws</span>
      </div>
      <div class="tournament-table_heading-text">
        Draws
      </div>
    </th>
    <th class="tournament-table_th">
      <div class="tournament-table_ico-holder">
        <span class="ico-points">Points</span>
      </div>
      <div class="tournament-table_heading-text">
        Points
      </div>
    </th>
  </tr>
  <!-- ko foreach: {data: rrBrackets().teams, as: 'team' } -->
  <tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
    <td class="tournament-table_td" data-bind="text: team.position"></td>
    <td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
      <a class="tournament-table_team tournament-table_team__big" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}"></a>
    </td>
    <td class="tournament-table_td" data-bind="text: team.battle_played"></td>
    <td class="tournament-table_td" data-bind="text: team.wins"></td>
    <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses"></td>
    <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws"></td>
    <td class="tournament-table_td" data-bind="text: team.extra_statistics.points"></td>
  </tr>
  <!-- /ko -->
</table>​

ok, from what I am gathering from the various posts I have been reading and videos I have been watching, I need to find some critical "Tag" in the coding of the web page and from that I can eventually start pulling data. I hit F12 on IE to view the code, and then in the code area I did a search on some of the display text in the area I was looking and found the above chunk of "code". With a lot of GUESSING I am hoping I grabbed the right chunk. Now to figure out what that critical tag is and how to use it. By the way, what code is that web page in?

解决方案

So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:

Option Explicit

Private Sub CommandButton1_Click()

'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
 'Microsoft HTML object Library.
 'Code will NOT run otherwise.

Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection

Set objIE = New SHDocVw.InternetExplorer

Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library

Dim RowNumber As Integer
            RowNumber = 1

With objIE
    .Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
    .Visible = 0
    Do While .READYSTATE <> 4: DoEvents: Loop
        Application.Wait (Now + TimeValue("0:00:01"))


        Set htmlDoc = .document

        Dim ButtonRoundData As Variant
        Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")

        Dim ButtonData As Variant
        Set ButtonData = htmlDoc.getElementsByClassName("groups_link")



        Dim button As HTMLLinkElement
        For Each button In ButtonData

           Debug.Print button.nodeName

            button.Click

               Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then

            Set htmlCurrentDoc = .document
            Dim RawData As HTMLTable
            Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)



            Dim ColumnNumber As Integer
            ColumnNumber = 1

            Dim hRow As HTMLTableRow
            Dim hCell As HTMLTableCell
            For Each hRow In RawData.Rows

                For Each hCell In hRow.Cells
                    Cells(RowNumber, ColumnNumber).Value = hCell.innerText
                    ColumnNumber = ColumnNumber + 1
                Next hCell
                ColumnNumber = 1
                RowNumber = RowNumber + 1
            Next hRow

            RowNumber = RowNumber + 3
        Next button
    End With

End Sub

What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...

for Debugging i suggest to set .Visible to 1, so you will se what happens.

EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.

EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.

Here some sample data i got in Excel:

这篇关于从可能不被格式化为表格的网页提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆