如何在两个相同班级之间仅获取一流班级的数据 [英] How to get only first class' data between two same classes
问题描述
在 https://www.hltv.org/matches 页上,匹配项除以日期,但班级是相同的.我的意思是,
On https://www.hltv.org/matches page, matches divided by dates but the classes are same. I mean,
这是今天的比赛课
<div class="match-day"><div class="standard-headline">2018-05-01</div>
这是tommorow的比赛类.
This is tommorow's match class.
<div class="match-day"><div class="standard-headline">2018-05-02</div>
我想做的是,我想在"standard-headline"类下获得链接,但仅是今天的比赛.就像,获得唯一的第一个.
What i'm trying to do is, I wanna get the links under the "standard-headline" class but only today's matches. Like, getting the only first one.
这是我的代码.
import urllib.request
from bs4 import BeautifulSoup
headers = {} # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0' # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers) # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read() # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')
matchlinks = []
# Getting the match pages' links.
for links in soup.find_all('div', class_='upcoming-matches'): # Looking for "upcoming-matches" class in source.
for links in soup.find_all('a'): # Finding "a" tag under "upcoming-matches" class.
clearlink = links.get('href') # Getting the value of variable.
if clearlink.startswith('/matches/'): # Checking for if our link starts with "/matches/"
matchlinks.append('https://hltv.org' + clearlink) # Adding into list.
推荐答案
实际上,该网站首先显示了今天的比赛(在顶部),然后显示了接下来的几天.因此,如果您想获得今天的比赛,只需使用find()
,它返回找到的第一比赛.
Actually, the website shows today's matches first (at the top), and then the next days'. So, if you want to get today's matches, you can simply use find()
, which return the first match found.
使用它会给您您想要的东西:
Using this will give you what you want:
today = soup.find('div', class_='match-day')
但是,如果要显式指定日期,则可以使用text='2018-05-02'
作为find()
方法的参数来查找包含今天日期的标记.但是,请注意,在页面源中,标记是<span class="standard-headline">2018-05-02</span>
而不是<div>
标记.获取此标签后,使用 .parent
获取<div class="match-day">
标签.
But, if you want to explicitly specify the date, you can find the tag containing today's date, by using text='2018-05-02'
as a parameter for the find()
method. But, note that in the page source, the tag is <span class="standard-headline">2018-05-02</span>
and not a <div>
tag. After getting this tag, use .parent
to get the <div class="match-day">
tag.
today = soup.find('span', text='2018-05-02').parent
同样,如果您想使解决方案更通用,则可以使用datetime.date.today()
代替硬编码日期.
Again, if you want to make the solution more generic, you can use datetime.date.today()
instead of the hard-coded date.
today = soup.find('span', text=datetime.date.today()).parent
您必须为此导入datetime
模块.
这篇关于如何在两个相同班级之间仅获取一流班级的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!