使用 Python 从多个网页中提取日期 [英] Extract date from multiple webpages with Python
问题描述
我想提取新闻文章在网站上发布的日期.对于某些网站,我有确切的 html 元素,其中日期/时间是 (div, p, time) 但在某些网站上我没有:
I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:
这些是一些网站(德国网站)的链接:
These are the links for some websites (german websites):
(2020 年 11 月 3 日)http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226
(3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226
(2020 年 12 月 1 日)http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&;sq=&kategorie_id=&date_from=&date_to=
(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=
(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905
(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905
我已经尝试了 3 种不同的 Python 库解决方案,例如 requests
、htmldate
和 date_guesser
,但我总是得到 None,或者在htmldate
库的情况,我总是得到相同的日期(2020.1.1)
I have tried 3 different solutions with Python libs such as requests
, htmldate
and date_guesser
but I'm always getting None, or in case of htmldate
lib, I always get same date (2020.1.1)
from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy
# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')
# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')
# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')
我做错了什么吗?
你能告诉我有没有办法从这样的网站中提取发布日期(我没有特定的 div、p 和 datetime 元素).
Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).
重要!我想进行通用日期提取,以便我可以将这些链接放入 for 循环并对它们运行相同的函数.
IMPORTANT! I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.
推荐答案
我在某些日期解析库方面从未取得太大成功,因此我通常会走另一条路.我相信从您的问题中的这些站点中提取日期字符串的最佳方法是使用正则表达式.
I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.
网站:linden.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
03-11-2020
网站:buchholterberg.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020
更新 12-04-2020
我查看了您提到的两个 Python 库的源代码:htmldate 和 date_guesser.这些库目前都无法从您在问题中列出的 3 个来源中提取日期.缺乏提取的主要原因与这些目标网站的日期格式和语言(德语)有关.
Update 12-04-2020
I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.
我有一些空闲时间,所以我为你整理了这些.下面的答案可以很容易地修改为从任何网站中提取,并且可以根据目标来源的格式根据需要进行改进.它目前从 URL 中包含的所有链接中提取.
I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.
所有网址
import requests
import re as regex
from bs4 import BeautifulSoup
def extract_date(can_of_soup):
page_body = can_of_soup.find('body')
clean_body = ''.join(str(page_body).replace('\n', ''))
if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
if find_date:
clean_tuples = [i for i in list(find_date.groups()) if i]
return ''.join(clean_tuples[1])
else:
tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
for tag in tags:
date_tag = page_body.find('div', {'class': f'{tag}'})
if date_tag is not None:
children = date_tag.findChildren()
if children:
find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
return ''.join(find_date.groups())
else:
return ''.join(date_tag.contents)
def get_soup(target_url):
response = requests.get(target_url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
'&sq=&kategorie_id=&date_from=&date_to=',
'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
'https://www.wallisellen.ch/aktuellesinformationen/924227',
'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
'=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}
for url in urls:
html = get_soup(url)
article_date = extract_date(html)
print(article_date)
这篇关于使用 Python 从多个网页中提取日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!