Python - 使用 BeautifulSoup 抓取问题 [英] Python - Issue Scraping with BeautifulSoup
问题描述
我正在尝试使用 Beautiful Soup 4 和 URLLIB 作为个人项目来抓取 Stack Overflow 作业页面.我正面临一个问题,即我试图抓取每页上列出的 50 个工作的所有链接.我正在使用正则表达式来识别这些链接.即使我正确引用了标签,我也面临以下两个具体问题:
I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
与源代码中清晰可见的 50 个链接不同,我每次只得到 25 个结果作为我的输出(在考虑删除初始不相关链接之后)
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
链接在源代码和我的输出中的排序方式有所不同.
There's a difference between how the links are ordered in the source code and my output.
这是我的代码.对此的任何帮助将不胜感激:
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
推荐答案
免责声明:我做了一些询问我自己的部分答案.
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
过程:
- 使用
soup.find
而不是soup.find_all
.这将给出一个 JSONbs4.element.Tag
json.loads(s.text)
是一个嵌套的字典.访问itemListElement
键的值以获取 url 的字典,并转换为列表.
- Use
soup.find
rather thansoup.find_all
. This will give a JSONbs4.element.Tag
json.loads(s.text)
is a nested dict. Access the values foritemListElement
key to get a dict of urls, and convert to list.
这篇关于Python - 使用 BeautifulSoup 抓取问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!