显示Web刮内容 [英] Displaying contents of web scrape

查看:93
本文介绍了显示Web刮内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在code下面显示的所有字段出到screen.Is有什么办法可以得到域一起对方,因为他们将出现在一个数据库或为preadsheet.In源code领域的跟踪,日期,日期时间,等级,距离和奖品在resultsBlockHeader DIV类被发现,而翅(名次)灰狗,陷阱,SP timeSec和时间距离的股利resultsBlock.I被发现在尝试让他们显示如下
轨道,日期,日期时间,等级,距离,奖品,翅,猎犬,陷阱,SP,timeSec,timeDistance都在一个line.Any帮助AP preciated。

 从进口的urllib的urlopen从BS4进口BeautifulSoup
HTML =的urlopen(http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754)bsObj = BeautifulSoup(HTML,'LXML')
NAMELIST = bsObj。的findAll(格,{级:轨道})
在名称列表名称:
 打印(名称。get_text())NAMELIST = bsObj。的findAll(格,{级:日期})
在名称列表名称:
 打印(名称。get_text()) NAMELIST = bsObj。的findAll(格,{级:日期时间})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(格,{级:品位})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(格,{级:距离})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(格,{级:奖品})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:第一要鳍})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:重要的灰狗})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:陷阱})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:SP})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:timeSec})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:timeDistance})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:重要的教练})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(礼,{级:第一要评论})
在名称列表名称:
 打印(名称。get_text())
NAMELIST = bsObj。的findAll(格,{级:resultsBlockFooter})
在名称列表名称:
 打印(名称。get_text())
 NAMELIST = bsObj。的findAll(礼,{级:第一要素})
在名称列表名称:
 打印(名称。get_text())


解决方案

首先,确保你不违反网站的使用条款 - 留在法律方面

标记是不是很容易刮掉,但我会做的是遍历赛头和每一个标题,获取有关比赛所需的信息。然后,让兄弟姐妹结果块,并提取行。示例code,让你开始 - 提取跟踪和灰狗:

 从pprint进口pprint
从进口的urllib2的urlopen从BS4进口BeautifulSoup
HTML =的urlopen(http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754)
汤= BeautifulSoup(HTML,'LXML')行= []
在soup.find_all标题(格,类_ =resultsBlockHeader):
    轨道= header.find(格,类_ =轨道)。get_text(条= TRUE)    结果= header.find_next_sibling(格,类_ =resultsBlock)。find_all(UL,类_ =一号线)
    对于结果的结果:
        灰狗= result.find(礼,类_ =灰狗)。get_text(条= TRUE)        rows.append({
            轨迹:轨道,
            灰狗:灰狗
        })pprint(行)

请注意,每次你在表中看到一行实际上是重新通过3线标记psented $ P $:

 < UL类=内容一号线>
   ...
< / UL>
< UL类=2号线的内容>
   ...
< / UL>
< UL类=内容,3号线>
   ...
< / UL>

灰狗值中的第一个 UL (用行1 类),您可能需要获得 2号线 3号线使用结果.find_next_sibling(UL的class =2号线) result.find_next_sibling(UL的class =3号线)

The code below displays all the fields out onto the screen.Is there a way I could get the fields "alongside" each other as they would appear in a database or in a spreadsheet.In the source code the fields track,date,datetime,grade,distance and prizes are found in the resultsBlockHeader div class,and the Fin(finishing position) Greyhound,Trap,SP timeSec and Time Distance are found in Div resultsBlock.I am trying to get them displayed like this track,date,datetime,grade,distance,prizes,fin,greyhound,trap,sp,timeSec,timeDistance all in one line.Any help appreciated.

from urllib import urlopen

from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")

bsObj = BeautifulSoup(html, 'lxml')
nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
 print(name. get_text())

nameList = bsObj. findAll("div", {"class": "date"})
for name in nameList:
 print(name. get_text())

 nameList = bsObj. findAll("div", {"class": "datetime"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "grade"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
 print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
 print(name. get_text())

解决方案

First of all, make sure you are not violating the website's Terms of Use - stay on the legal side.

The markup is not very easy to scrape, but what I would do is to iterate over the race headers and for every header, get the desired information about the race. Then, get the sibling results block and extract the rows. Sample code to get you started - extracts the track and the greyhound:

from pprint import pprint
from urllib2 import urlopen

from bs4 import BeautifulSoup


html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div", class_="track").get_text(strip=True)

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)

        rows.append({
            "track": track,
            "greyhound": greyhound
        })

pprint(rows)

Note that every row you see in the tables is actually represented by 3 lines in the markup:

<ul class="contents line1">
   ...
</ul>
<ul class="contents line2">
   ...
</ul>
<ul class="contents line3">
   ...
</ul>

The greyhound value was inside the first ul (with line1 class), you may need to get the line2 and line3 using the result.find_next_sibling("ul", class="line2") and result.find_next_sibling("ul", class="line3").

这篇关于显示Web刮内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆