如何获取这些数据 [英] how to get at this data

查看:125
本文介绍了如何获取这些数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找从下面的html示例中突出显示和接界的三个项目。我还突出显示了一些看起来很有用的标记。

你会怎么做?

强大>



好吧,这不是一个很好的问题,我真的很惊讶,它没有得到更多的投票。噢,这里有一些别人的面包屑。



我想要的四个信息中的三个是具有已知id的span元素的内部文本(即,yfs_l10_gm150220c00036500为0.83美元,所以我下面的帮助类似乎是一个不错的直接镜头:



  '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' GetSpanTextForId ''从传入的id'param doc:源htmlDocument'返回来自span元素的内部文本''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''   ''''函数GetSpanTextForId(ByRef doc As HTMLDocument,ByVal spanId As String)As Error'错误处理错误GoTo ErrHandler Dim sRoutine As String sRoutine = cModule& .GetSpanTextForIdCheckArgNotNothing doc,docCheckArgNotBadString spanId,spanId'Procedure Dim oSpan As HTMLSpanElement Set oSpan = doc.getElementById(spanId)Check not oSpan Is Nothing,找不到包含id的span:& Bracket(spanId)GetSpanTextForId = oSpan.innerText退出FunctionErrHandler:选择Case DspErrMsg(sRoutine)Case Is = vbAbort:Stop:恢复'调试模式 - 跟踪案例Is = vbRetry:恢复'再试一次Case Is = vbIgnore:'结束例程结束SelectEnd函数 



OpenInterest它是具有ID的元素的第二个子元素的表的一部分。以下方法返回紧跟在单元格后面的单元格,并显示我想要的文本(即开放兴趣)



  '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' ''  GetOpenInterest''最新的开放兴趣。''param doc:源HTMLDocument''''  '''''''函数GetOpenInterest(ByRef do c作为HTMLDocument)作为整数Dim tbl作为IHTMLTable Set tbl = GetSummaryDataTable(doc,1)Dim k As Integer k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl,Open Interest:)GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl,k + 1) ))End FunctionFunction GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable,ByRef s As String)As Integer'错误处理错误GoTo ErrHandler Dim sRoutine As String sRoutine = cModule& .GetCellNumberForTextStartingWithCheckArgNotNothing tbl,tbl'Procedure Dim tblCell As HTMLTableCell Dim k As Integer for each tblCell in tbl.Cells if tblCell.innerText Like(*& s)Then GetCellNumberForTextStartingWith = k Exit Function End If if = k + 1接下来'如果我们到了这里就找不到它了GetCellNumberForTextStartingWith = -1退出FunctionErrHandler:选择Case DspErrMsg(sRoutine)Case Is = vbAbort:Stop:恢复'调试模式 - 跟踪案例是= vbRetry:恢复'再试一次案例Is = vbIgnore:'结束例程结束选择结束函数函数GetCellTextFromCellNumber(ByRef tbl作为IHTMLTable,ByRef nbr作为整数)作为字符串'错误处理错误GoTo ErrHandler Dim sRoutine As String sRoutine = cModule& .GetCellNumberForTextStartingWithCheckArgNotNothing tbl,tblCheck tbl.Cells.Length> 0,table is emptyCheck tbl.Cells.Length> = nbr,table only has& tbl.Cells.Length& 细胞;无法获得细胞数目& nbr'过程GetCellTextFromCellNumber = tbl.Cells(nbr).innerText退出FunctionErrHandler:选择Case DspErrMsg(sRoutine)Case Is = vbAbort:Stop:恢复'调试模式 - 跟踪案例是= vbRetry:恢复'再试一次Case Is = vbIgnore:' End routine End EndEnd Function  



这些方法工作正常,有很多不同的方法可行,其中包括建议作为答案的正则表达式解析方法。 RedShift的优秀链接更多地分析了html并提出了一个策略。



干杯


解决方案

我可能会使用XML解析器来获得首先是文本内容(或者:xmlString.replace(/< [^>] +> / g,)用空字符串替换所有标签),然后使用以下正则表达式提取所需的信息:

  /  -  OPR \s +(\ d + \.\d +)/ 
/ Bid:\ s + (\d + \.\d +)/
/Ask:\s+(\d+\.\d+)/
/开启关键词:\s +(\d +, \ d +)/

这个过程可以通过nodejs(




现场演示:


  • 等待1秒,然后移除标签。 ,然后查找所有模式并创建一个表。
  • =falsedata-console =falsedata-babel =fal se <>

    wait = true; //设置为false来执行instant.var elem = document.getElementById(parsingStuff); var str = elem.textContent; var keywords = [-OPR,Bid:,Ask:,Open Interest :]; VAR输出= {}; VAR超时= 0;如果(等待)超时= 1000;的setTimeout(函数(){//删除标签elem.innerHTML = elem.textContent;},超时);如果(等待)超时= 2000;的setTimeout(函数(){//寻找模式为(VAR I = 0; I< keywords.length;我++){输出[关键字[I] = str.match(正则表达式(关键字[ i] +\\s +(\\d + [\\。,] \\d +)))[1];} //创建找到的数据的基本表elem.innerHTML = ; var table = document.createElement(table); for(k in output){var tr = document.createElement(tr); var th = document.createElement(th); var td = document .createElement(td); th.style.border =1px solid grey; td.style.border =1px solid grey; th.textContent = k; td.textContent = output [k]; tr.appendChild (th); tr.appendChild(td) ; table.appendChild(tr);} elem.appendChild(table);},timeout);

     < div id =parsingStuff> < div class =yfi_rt_quote_summaryid =yfi_rt_quote_summary> < div class =hd> < div class =title> < h2> GM Feb 2015 36.500电话(GM150220C00036500)< / h2> < span class =rtq_exch> < span class =rtq_dash>  - < / span> OPR< / span> < span class =wl_sign>< / span> < / DIV> < / DIV> < div class =yfi_rt_quote_summary_rt_top sigfig_promo_1> < DIV> < span class =time_rtq_ticker> < span id =yfs_110_gm150220c00036500> 0.83< / span> < /跨度> < / DIV> < / div>未定义< / div>未定义< div class =yui-u first yfi-start-content> < div class =yfi_quote_summary> < div id =yfi_quote_summary_dataclass =rtq_table> < table id =table1> < TR> < th scope =rowwidth =48%>出价:< / th> < td class =yfnc_tabledata1> < span id =yfs_b00_gm150220c00036500> 0.76< / span> < / TD> < / TR> < TR> < th scope =rowwidth =48%>问:< / th> < td class =yfnc_tabledata1> < span id =yfs_a00_gm150220c00036500> 0.90< / span> < / TD> < / TR> < /表> < table id =table2> < TR> < th scope =rowwidth =48%>未平仓合约:< / th> < td class =yfnc_tabledata1> 11,579< / td> < / TR> < /表> < / DIV> < / DIV> < / div>< / div>  

I am looking to scrape the three items that are highlighted and bordered from the html sample below. I've also highlighted a few markers that look useful.

How would you do this?

A Solution

Ok so this wasn't a great question and I'm actually surprised it didn't get down-voted more! Oh well, here are some bread crumbs for someone else.

Three of the four items of info I want are the inner text of a span element with a known id (ie, $0.83 for "yfs_l10_gm150220c00036500"), so I the following helper class seems to be a decent and direct shot:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetSpanTextForId"
     
    CheckArgNotNothing doc, "doc"
    CheckArgNotBadString spanId, "spanId"
'   Procedure
    Dim oSpan As HTMLSpanElement
    Set oSpan = doc.getElementById(spanId)
    Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
    GetSpanTextForId = oSpan.innerText
    
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

The only item not directly known by a span is the OpenInterest which is part of a table that is the 2nd child of an element with an id. The following methods return the cell that immediately follows the cell with the text I want (ie, "Open Interest")

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
    Dim tbl As IHTMLTable
    Set tbl = GetSummaryDataTable(doc, 1)
    Dim k As Integer
    k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
    GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function


Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    
'   Procedure
    Dim tblCell As HTMLTableCell
    Dim k As Integer

    For Each tblCell In tbl.Cells
        If tblCell.innerText Like ("*" & s) Then
            GetCellNumberForTextStartingWith = k
            Exit Function
        End If
        k = k + 1
    Next
    
    ' if we got here it was not found so
    GetCellNumberForTextStartingWith = -1
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select
     
End Function

Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    Check tbl.Cells.Length > 0, "table is empty"
    Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
    
'   Procedure
    GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

These methods work fine but it does seem there are lots of different approaches that would work, including the regex parsing approach suggested as an answer. The excellent link by RedShift got more to the point of analyzing the html and coming up with a strategy.

Cheers

解决方案

I would probably use an XML parser to get the text content first (or this: xmlString.replace(/<[^>]+>/g, "") to replace all tags with empty strings), then use the following regexes to extract the information you need:

/-OPR\s+(\d+\.\d+)/
/Bid:\s+(\d+\.\d+)/
/Ask:\s+(\d+\.\d+)/
/Open Interest:\s+(\d+,\d+)/

This process can easily be done in nodejs (more info)or with any other language that supports regex.


live demo:

  • Waits 1 second, then removes tags.
  • Waits another second, then finds all patterns and creates a table.

wait = true; // Set to false to execute instantly.

var elem = document.getElementById("parsingStuff");
var str = elem.textContent;

var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;

if (wait) timeout = 1000;

setTimeout(function() { // Removing tags.
  elem.innerHTML = elem.textContent;
}, timeout);

if (wait) timeout = 2000;

setTimeout(function() { // Looking for patterns.
  for (var i = 0; i < keywords.length; i++) {
    output[keywords[i]] = str.match(RegExp(keywords[i] + "\\s+(\\d+[\\.,]\\d+)"))[1];
  }

  // Creating basic table of found data.
  elem.innerHTML = "";
  var table = document.createElement("table");
  for (k in output) {
    var tr = document.createElement("tr");
    var th = document.createElement("th");
    var td = document.createElement("td");

    th.style.border = "1px solid gray";
    td.style.border = "1px solid gray";

    th.textContent = k;
    td.textContent = output[k];

    tr.appendChild(th);
    tr.appendChild(td);

    table.appendChild(tr);
  }
  elem.appendChild(table);
}, timeout);

<div id="parsingStuff">
  <div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
    <div class="hd">
      <div class="title">
        <h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
        <span class="rtq_exch">
        <span class="rtq_dash">-</span>OPR
        </span>
        <span class="wl_sign"></span>
      </div>
    </div>
    <div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
      <div>
        <span class="time_rtq_ticker">

        <span id="yfs_110_gm150220c00036500">0.83</span>
        </span>
      </div>
    </div>undefined</div>undefined
  <div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
      <div id="yfi_quote_summary_data" class="rtq_table">
        <table id="table1">
          <tr>
            <th scope="row" width="48%">Bid:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_b00_gm150220c00036500">0.76</span>
            </td>
          </tr>
          <tr>
            <th scope="row" width="48%">Ask:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_a00_gm150220c00036500">0.90</span>
            </td>
          </tr>
        </table>
        <table id="table2">
          <tr>
            <th scope="row" width="48%">Open Interest:</th>

            <td class="yfnc_tabledata1">11,579</td>
          </tr>
        </table>
      </div>
    </div>
  </div>
</div>

这篇关于如何获取这些数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆