使用BeautifulSoup根据日期进行刮取 [英] Scraping based on date with BeautifulSoup

查看:108
本文介绍了使用BeautifulSoup根据日期进行刮取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python编程的新手.强调非常.我正在尝试建立我的第一个网络抓取项目(用于新闻报道的策划).

I am very new to python programming. Emphasis on VERY. I am trying to set up my first web scraping project (for news article curation).

我已经设法抓取新闻网站,并创建了一个循环,以所需的方式组织结果.我的问题是我计划每天刮一次网页,但仅针对当天发布的出版物.我不需要所有这些,因为那意味着我会得到很多重复.

I have already managed to scrape the news site and to create a loop that organizes the results how I want them. My issue is that I plan on scraping the web page once a day, but only for the publications that were published that same day. I don't want all of them because that would mean I would get a lot of repetition.

我知道这与通过datetime模块(带有if语句)转换日期有关,但是对于我来说,我一直找不到使它起作用的方法.

I know that it has something to do with converting the date via the datetime module (with an if statement), but for the life of me I couldn't find a way to make it work.

在html中,这是日期显示方式的示例:

In the html, this is an example of how the date is displayed:

<time datetime="2019-02-24T10:30:46+00:00">Feb 24, 2019 at 10:30</time>

这是我到目前为止所拥有的:

This is what I have so far:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime

my_url = "https://www.coindesk.com/category/business-news/legal"

# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()

# html parser
page_soup1 = soup(page_one, "html.parser")

# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )

for container in containers:
  link = container.attrs['href']
  publication_date = "published on " + container.time.text
  title = container.h3.text
  description = "(CoinDesk)-- " +  container.p.text

  print("link: " + link)
  print("publication_date: " + publication_date)
  print("title: " + title)
  print("description: " + description)  

推荐答案

您的time标记具有datetime属性,该属性提供比文本更好的日期时间表示.使用它.

Your time tag has a datetime attribute that is giving a much better datetime representation than the text. Use that.

您可以使用dateutil包来解析字符串.以下是示例代码:

You can use the dateutil package to parse the string. Following is a sample code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime, timedelta
from dateutil import parser
import pytz

my_url = "https://www.coindesk.com/category/business-news/legal"

# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()

# html parser
page_soup1 = soup(page_one, "html.parser")

# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )

for container in containers:
  ## get todays date.
  ## I have taken an offset as the site has older articles than today.
  today =  datetime.now() - timedelta(days=5)
  link = container.attrs['href']

  ## The actual datetime string is in the datetime attribute of the time tag.
  date_time = container.time['datetime']

  ## we will use the dateutil package to parse the ISO-formatted date.
  date = parser.parse(date_time)

  ## This date is UTC localised but the datetime.now() gives a "naive" date
  ## So we have to localize before comparison
  utc=pytz.UTC
  today = utc.localize(today)

  ## simple comparison
  if date >= today:
      print("article date", date)
      print("yesterday", today," \n")
      publication_date = "published on " + container.time.text
      title = container.h3.text.encode('utf-8')
      description = "(CoinDesk)-- " +  container.p.text

      print("link: " + link)
      print("publication_date: " + publication_date)
      print("title: ", title)
      print("description: " + description)

这篇关于使用BeautifulSoup根据日期进行刮取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆