抓取的数据数量有限? [英] Limited number of scraped data?
本文介绍了抓取的数据数量有限?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在抓捕一个网站,从今天的新闻到2015/2016年发布的新闻,一切似乎都正常.这些年之后,我再也无法抓到新闻了.您能否告诉我是否有任何变化?我应该得到672页,从该页获取标题和摘要:
I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news. Could you please tell me if anything has changed? I should get 672 pages getting titles and snippets from this page:
https://catania.liveuniversity.it/attualita/
但是我有大约.158.
but I have got approx. 158.
我正在使用的代码是:
import bs4, requests
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page_num=1
website="https://catania.liveuniversity.it/attualita/"
while True:
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')
title=soup.find_all('h2')
date=soup.find_all('span', attrs={'class':'updated'})
if soup.find_all('a', attrs={'class':'page-numbers'}):
website = f"https://catania.liveuniversity.it/attualita/page/{page_num}"
page_num +=1
print(page_num)
else:
break
df = pd.DataFrame(list(zip(dates, titles)),
columns =['Date', 'Titles'])
我认为代码中有一些更改(例如,下一页按钮中的内容,或者只是日期/标题标签中的内容).
I think there has been some changes in tags (for example in next page button, or just in the date/title tag).
推荐答案
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get(
"https://catania.liveuniversity.it/attualita/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
try:
data = [(x.select_one("span.updated").text, x.findAll("a")[1].text, x.select_one("div.entry-content").get_text(strip=True)) for x in soup.select(
"div.col-lg-8.col-md-8.col-sm-8")]
return data
except AttributeError:
print(r.url)
return False
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 673)]
allin = []
for f in fs:
f = f.result()
if f:
allin.extend(f)
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Content"])
print(df)
df.to_csv("result.csv", index=False)
这篇关于抓取的数据数量有限?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文