如何使用BeautifulSoup加快解析速度? [英] How to speed up parsing using BeautifulSoup?

查看:133
本文介绍了如何使用BeautifulSoup加快解析速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想列出韩国的音乐节清单,所以我试图抓取一个出售音乐节门票的网站:

 导入请求从bs4导入BeautifulSoupINTERPARK_BASE_URL ='http://ticket.interpark.com'#节日列表页面req = requests.get('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes')html = req.text汤= BeautifulSoup(html,'lxml')对于汤中的title_raw.find_all('span',class _ ='fw_bold'):标题= str(title_raw.find('a').text)url_raw = str(title_raw.find('a').get('href'))网址= INTERPARK_BASE_URL +网址_原始#详细信息页面req_detail = requests.get(URL)html_detail = req_detail.textsoup_detail = BeautifulSoup(html_detail,'lxml')details_1 = soup_detail.find('table',class _ ='table_goods_info')details_2 = soup_detail.find('ul',class _ ='info_Lst')图片= soup_detail.find('div',class _ ='poster')歌手= str(details_1.find_all('td')[4] .text)地方= str(details_1.find_all('td')[5] .text)date_text = str(details_2.find('span').text)image_url = str(image.find('img').get('src'))打印(标题)打印(URL)打印(歌手)打印(位置)打印(日期文本)打印(image_url) 

我用于循环浏览列表页面中的所有详细信息页面,但是加载每个详细信息页面太慢了.

如何加速我的代码?

解决方案

 导入请求从bs4导入BeautifulSoup导入json从datetime导入datetime为dt导入csvdef汤(内容):汤= BeautifulSoup(content,'html.parser')回汤def Main(url):r = request.get(URL)汤=汤(r.content)spans = soup.findAll('span',class _ ='fw_bold')链接= [f"{url [:27]} {span.a ['href']}"(跨度为跨度)返回链接def Parent():链接= Main("http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes")使用open("result.csv",'w',newline =",encoding ="utf-8")作为f:writer = csv.writer(f)writer.writerow([["Name","Singers","Location","Date","ImageUrl"])与request.Session()作为要求:对于链接中的链接:r = req.get(链接)汤=汤(r.content)脚本= json.loads(soup.find("script",type ="application/ld + json").text)名称=脚本[名称"]打印(f提取:{name}")歌手=脚本[表演者"] [名称"]位置=脚本[位置"] [名称"]datelist = list(script.values())[3:5]datet = []图片=脚本[图片"]对于日期列表中的日期:日期= dt.strptime(日期,'%Y%m%d').strftime('%d-%m-%Y')datet.append(日期)writer.writerow([姓名,歌手,位置,:" .join(datest),* image])父母() 

运行并检查输出在线

解决方案

import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime as dt
import csv


def Soup(content):
    soup = BeautifulSoup(content, 'html.parser')
    return soup


def Main(url):
    r = requests.get(url)
    soup = Soup(r.content)
    spans = soup.findAll('span', class_='fw_bold')
    links = [f"{url[:27]}{span.a['href']}" for span in spans]
    return links


def Parent():
    links = Main(
        "http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes")
    with open("result.csv", 'w', newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Name", "Singers", "Location", "Date", "ImageUrl"])
        with requests.Session() as req:
            for link in links:
                r = req.get(link)
                soup = Soup(r.content)
                script = json.loads(
                    soup.find("script", type="application/ld+json").text)
                name = script["name"]
                print(f"Extracting: {name}")
                singers = script["performer"]["name"]
                location = script["location"]["name"]
                datelist = list(script.values())[3:5]
                datest = []
                image = script["image"]
                for date in datelist:
                    date = dt.strptime(date,
                                       '%Y%m%d').strftime('%d-%m-%Y')
                    datest.append(date)
                writer.writerow(
                    [name, singers, location, " : ".join(datest), *image])


Parent()

Run&Check-Output-Online

View-Output

这篇关于如何使用BeautifulSoup加快解析速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆