使用 Python 从多个网页中提取日期 [英] Extract date from multiple webpages with Python

查看:83
本文介绍了使用 Python 从多个网页中提取日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取新闻文章在网站上发布的日期.对于某些网站,我有确切的 html 元素,其中日期/时间是 (div, p, time) 但在某些网站上我没有:

I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:

这些是一些网站(德国网站)的链接:

These are the links for some websites (german websites):

(2020 年 11 月 3 日)http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226

(3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226

(2020 年 12 月 1 日)http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&;sq=&kategorie_id=&date_from=&date_to=

(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=

(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905

(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905

我已经尝试了 3 种不同的 Python 库解决方案,例如 requestshtmldatedate_guesser,但我总是得到 None,或者在htmldate 库的情况,我总是得到相同的日期(2020.1.1)

I have tried 3 different solutions with Python libs such as requests, htmldate and date_guesser but I'm always getting None, or in case of htmldate lib, I always get same date (2020.1.1)

from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy

# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')


# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')


# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')

我做错了什么吗?

你能告诉我有没有办法从这样的网站中提取发布日期(我没有特定的 div、p 和 datetime 元素).

Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).

重要!我想进行通用日期提取,以便我可以将这些链接放入 for 循环并对它们运行相同的函数.

IMPORTANT! I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.

推荐答案

我在某些日期解析库方面从未取得太大成功,因此我通常会走另一条路.我相信从您的问题中的这些站点中提取日期字符串的最佳方法是使用正则表达式.

I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.

网站:linden.ch

import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime

url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output 
03-11-2020

网站:buchholterberg.ch

import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime

url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020

更新 12-04-2020

我查看了您提到的两个 Python 库的源代码:htmldate 和 date_guesser.这些库目前都无法从您在问题中列出的 3 个来源中提取日期.缺乏提取的主要原因与这些目标网站的日期格式和语言(德语)有关.

Update 12-04-2020

I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.

我有一些空闲时间,所以我为你整理了这些.下面的答案可以很容易地修改为从任何网站中提取,并且可以根据目标来源的格式根据需要进行改进.它目前从 URL 中包含的所有链接中提取.

I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.

所有网址

import requests
import re as regex
from bs4 import BeautifulSoup

def extract_date(can_of_soup):
   page_body = can_of_soup.find('body')
   clean_body = ''.join(str(page_body).replace('\n', ''))
   if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
     date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
     find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
     if find_date:
        clean_tuples = [i for i in list(find_date.groups()) if i]
        return ''.join(clean_tuples[1])
   else:
       tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
       for tag in tags:
          date_tag = page_body.find('div', {'class': f'{tag}'})
          if date_tag is not None:
            children = date_tag.findChildren()
            if children:
                find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
                return ''.join(find_date.groups())
            else:
                return ''.join(date_tag.contents)


def get_soup(target_url):
   response = requests.get(target_url)
   soup = BeautifulSoup(response.content, 'html.parser')
   return soup


urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
    'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
    '&sq=&kategorie_id=&date_from=&date_to=',
    'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
    'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
    'https://www.wallisellen.ch/aktuellesinformationen/924227',
    'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
    '=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
    'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}


for url in urls:
   html = get_soup(url)
   article_date = extract_date(html)
   print(article_date)

这篇关于使用 Python 从多个网页中提取日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆