无法使用python和beautifulsoup抓取网页中的某些href [英] Unable to crawl some href in a webpage using python and beautifulsoup

查看:119
本文介绍了无法使用python和beautifulsoup抓取网页中的某些href的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Python 3.4和bs4爬行网页,以收集塞尔维亚在Rio2016玩的比赛结果.因此,URL 此处包含所有匹配结果的链接她曾玩过.

I am currently crawling a web page using Python 3.4 and bs4 in order to collect the match results played by Serbia in Rio2016. So the url here contains links to all the match results she played, for example this.

然后我发现该链接位于html源中,如下所示:

Then I found that the link is located in the html source like this:

<a href="/en/volleyball/women/7168-serbia-italy/post" ng-href="/en/volleyball/women/7168-serbia-italy/post">
    <span class="score ng-binding">3 - 0</span>
</a>

但是经过几次试验,此href="/en/volleyball/women/7168-serbia-italy/post"从未出现.然后,我尝试运行以下代码以从网址中获取所有href:

But after several trials, this href="/en/volleyball/women/7168-serbia-italy/post" never show up. Then I tried to run the following code to get all the href from the url:

from bs4 import BeautifulSoup
import requests

Countryr = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
countrySoup = BeautifulSoup(Countryr.text)

for link in countrySoup.find_all('a'):
    print(link.get('href'))

然后发生了一件奇怪的事情. href="/en/volleyball/women/7168-serbia-italy/post"根本不包含在输出中.

Then a strange thing happened. The href="/en/volleyball/women/7168-serbia-italy/post" is not included in the output at all.

我发现此href位于该网址旁边的选项卡页href="#scheduldedOver"之一中,并且由以下HTML代码控制:

I found that this href is located in one of the tab pages href="#scheduldedOver" in side this url, and it is controlled by the following HTML code:

<nav class="tabnav">
    <a href="#schedulded" ng-class="{selected: chosenStatus == 'Pre' }" ng-click="setStatus('Pre')" ng-href="#schedulded">Scheduled</a>
    <a href="#scheduldedLive" ng-class="{selected: chosenStatus == 'Live' }" ng-click="setStatus('Live')" ng-href="#scheduldedLive">Live</a>
    <a href="#scheduldedOver" class="selected" ng-class="{selected: chosenStatus == 'Over' }" ng-click="setStatus('Over')" ng-href="#scheduldedOver">Complete</a>
</nav>

那我应该如何在标签页中使用BeautifulSoup获取href?

Then how should I get the href using BeautifulSoup inside a tab page?

推荐答案

该数据是动态创建的,如果查看实际来源,您会看到

The data is created dynamically, if you look at the actual source you can see Angularjs templating.

您仍然可以通过模仿ajax调用来获取json格式的所有信息,在源yuuuuou中还可以看到一个div,例如:

You can still get all the info in json format by mimicking an ajax call, in the source yuuuuou can also see a div like:

<div id="AngularPanel" class="main-wrapper" ng-app="fivb"
data-servicematchcenterbar="/en/api/volley/matches/341/en/user/lives"
data-serviceteammatches="/en/api/volley/matches/WOG2016/en/user/team/3017"
data-servicelabels="/en/api/labels/Volley/en" 
data-servicelive="/en/api/volley/matches/WOG2016/en/user/live/">

使用data-servicematchcenterbar href将为您提供所有信息:

Using the data-servicematchcenterbar href will give you all the info:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

r = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
soup = BeautifulSoup(r.content)

base = "http://rio2016.fivb.com/"

json = requests.get(urljoin(base, soup.select_one("#AngularPanel")["data-serviceteammatches"])).json()

在json中,您将看到如下输出:

In json you will see output like:

{"Id": 7168, "MatchNumber": "006", "TournamentCode": "WOG2016", "TournamentName": "Women's Olympic Games 2016",
        "TournamentGroupName": "", "Gender": "", "LocalDateTime": "2016-08-06T22:35:00",
        "UtcDateTime": "2016-08-07T01:35:00+00:00", "CalculatedMatchDate": "2016-08-07T03:35:00+02:00",
        "CalculatedMatchDateType": "user", "LocalDateTimeText": "August 06 2016",
        "Pool": {"Code": "B", "Name": "Pool B", "Url": "/en/volleyball/women/results and ranking/round1#anchorB"},
        "Round": 68,
        "Location": {"Arena": "Maracanãzinho", "City": "Maracanãzinho", "CityUrl": "", "Country": "Brazil"},
        "TeamA": {"Code": "SRB", "Name": "Serbia", "Url": "/en/volleyball/women/teams/srb-serbia",
                  "FlagUrl": "/~/media/flags/flag_SRB.png?h=60&w=60"},
        "TeamB": {"Code": "ITA", "Name": "Italy", "Url": "/en/volleyball/women/teams/ita-italy",
                  "FlagUrl": "/~/media/flags/flag_ITA.png?h=60&w=60"},
        "Url": "/en/volleyball/women/7168-serbia-italy/post", "TicketUrl": "", "Status": "Over", "MatchPointsA": 3,
        "MatchPointsB": 0, "Sets": [{"Number": 1, "PointsA": 27, "PointsB": 25, "Hours": 0, "Minutes": "28"},
                                    {"Number": 2, "PointsA": 25, "PointsB": 20, "Hours": 0, "Minutes": "25"},
                                    {"Number": 3, "PointsA": 25, "PointsB": 23, "Hours": 0, "Minutes": "27"}],
        "PoolRoundName": "Preliminary Round", "DayInfo": "Weekend Day",
        "WeekInfo": {"Number": 31, "Start": 7, "End": 13}, "LiveStreamUri": ""},

您可以解析其中的任何内容.

You can parse whatever you need from those.

这篇关于无法使用python和beautifulsoup抓取网页中的某些href的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆