Pythonic beautifulSoup4:如何从Wikipedia类别的下一页链接中获取剩余标题 [英] Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category
问题描述
我成功编写了以下代码,以获取 Wikipedia类别的标题.该类别包含404个以上的标题.但是我的输出文件仅提供200个标题/页面.如何扩展我的代码以获取该类别链接的所有标题(下一页),依此类推.
I wrote successfully the following code to get the titles of a Wikipedia category. The category consists more than 404 titles. But my output file gives only 200 titles/pages. how to extend my code to get all the titles of that category's link (next page) and so on.
命令:python3 getCATpages.py
getCATpages.py的代码;-
Code of getCATpages.py ;-
from bs4 import BeautifulSoup
import requests
import csv
#getting all the contents of a url
url = 'https://en.wikipedia.org/wiki/Category:Free software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
#showing the category-pages Summary
catPageSummaryTag = soup.find(id='mw-pages')
catPageSummary = catPageSummaryTag.find('p')
print(catPageSummary.text)
#showing the category-pages only
catPageSummaryTag = soup.find(id='mw-pages')
tag = soup.find(id='mw-pages')
links = tag.findAll('a')
# giving serial numbers to the output print and limiting the print into three
counter = 1
for link in links[:3]:
print (''' '''+str(counter) + " " + link.text)
counter = counter + 1
#getting the category pages
catpages = soup.find(id='mw-pages')
whatlinksherelist = catpages.find_all('li')
things_to_write = []
for titles in whatlinksherelist:
things_to_write.append(titles.find('a').get('title'))
#writing the category pages as a output file
with open('001-catPages.csv', 'a') as csvfile:
writer = csv.writer(csvfile,delimiter="\n")
writer.writerow(things_to_write)
推荐答案
这个想法是跟随下一页直到页面上没有下一页"链接.我们将维持一个网络抓取会话,同时发出多个请求,以在列表中收集所需的链接标题:
The idea is to follow the next page until there is no "next page" link on the page. We'll maintain a web-scraping session while making multiple requests collecting the desired link titles in a list:
from pprint import pprint
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'https://en.wikipedia.org/wiki/Category:Free software'
def get_next_link(soup):
return soup.find("a", text="next page")
def extract_links(soup):
return [a['title'] for a in soup.select("#mw-pages li a")]
with requests.Session() as session:
content = session.get(base_url).content
soup = BeautifulSoup(content, 'lxml')
links = extract_links(soup)
next_link = get_next_link(soup)
while next_link is not None: # while there is a Next Page link
url = urljoin(base_url, next_link['href'])
content = session.get(url).content
soup = BeautifulSoup(content, 'lxml')
links += extract_links(soup)
next_link = get_next_link(soup)
pprint(links)
打印:
['Free software',
'Open-source model',
'Outline of free software',
'Adoption of free and open-source software by public institutions',
...
'ZK Spreadsheet',
'Zulip',
'Portal:Free and open-source software']
省略了无关的CSV编写部分.
Omitted the irrelevant CSV writing part.
这篇关于Pythonic beautifulSoup4:如何从Wikipedia类别的下一页链接中获取剩余标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!