加快美丽汤 [英] Speeding up beautifulsoup

查看:63
本文介绍了加快美丽汤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行此课程网站的抓取工具,我想知道一旦将它放入beautifulsoup,是否有更快的方法来抓取该页面.它花费的时间比我预期的更长.

I'm running a scraper of this course website and I'm wondering whether there's a faster way to scrape the page once I have it put into beautifulsoup. It takes way longer than I would have expected.

提示?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))

parsedClasses = {}

for i in range(len(select.options)):
    print i
    select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
    select.options[i].click()
    upperLevelClassButton = driver.find_element_by_id("Body_Level500")
    upperLevelClassButton.click()
    driver.find_element_by_name("ctl00$Body$ctl15").click()

    soup = BeautifulSoup(driver.page_source, "lxml")

    courses = soup.select(".CrsOpen")
    for course in courses:
        courseName = course.find_next(class_="ResultTable")["id"][13:]
        parsedClasses[courseName] = []
        print courseName
        for section in course.select(".SecOpen"):
            classInfo = section.find_all_next(class_="ItemRowCenter")
            parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), int(classInfo[2].string)))

print parsedClasses
print parsedClasses['FL2014' + 'A46' + '3284']

driver.quit()

推荐答案

好的,您可以通过以下方法真正加快速度:

Okay, you can really speed this up by:

  • go down to the low-level - see what underlying requests are being made and simulate them
  • let BeautifulSoup use lxml parser
  • use SoupStrainer for parsing only relevant parts of a page

由于这是ASP.NET生成的形式,并且由于其安全功能,因此情况变得有些复杂.这是完整的代码,请不要担心-我已经添加了注释并可以提出疑问:

Since this is ASP.NET generated form and due to it's security features, things get a bit more complicated. Here's the complete code, don't be afraid of it - I've added comments and open to questions:

import re
from bs4 import BeautifulSoup, SoupStrainer
import requests

# start session and get the search page
session = requests.Session()
response = session.get('https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx')

# parse the search page using SoupStrainer and lxml
strainer = SoupStrainer('form', attrs={'id': 'form1'})
soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)

# get the view state, event target and validation values
viewstate = soup.find('input', id='__VIEWSTATE').get('value')
eventvalidation = soup.find('input', id='__EVENTVALIDATION').get('value')
search_button = soup.find('input', value='Search')
event_target = re.search(r"__doPostBack\('(.*?)'", search_button.get('onclick')).group(1)

# configure post request parameters
data = {
    '__EVENTTARGET': event_target,
    '__EVENTARGUMENT': '',
    '__LASTFOCUS': '',
    '__VIEWSTATE': viewstate,
    '__EVENTVALIDATION': eventvalidation,
    'ctl00$Body$ddlSemester': '201405',
    'ctl00$Body$ddlSession': '',
    'ctl00$Body$ddlDept': '%',
    'ctl00$Body$ddlAttributes': '0',
    'ctl00$Body$Days': 'rbAnyDay',
    'ctl00$Body$Time': 'rbAnyTime',
    'ctl00$Body$cbMorning': 'on',
    'ctl00$Body$cbAfternoon': 'on',
    'ctl00$Body$cbEvening': 'on',
    'ctl00$Body$tbStart': '9:00am',
    'ctl00$Body$tbEnds': '5:00pm',
    'ctl00$Body$ddlUnits': '0',
    'ctl00$Body$cbHideIStudy': 'on',
    'ctl00$Body$courseList$hidHoverShow': 'Y',
    'ctl00$Body$courseList$hidDeptBarCnt': '',
    'ctl00$Body$courseList$hidSiteURL': 'https://acadinfo.wustl.edu/Courselistings',
    'ctl00$Body$courseList$hidExpandDetail': '',
    'ctl00$Body$hidDay': ',1,2,3,4,5,6,7',
    'ctl00$Body$hidLevel': '1234',
    'ctl00$Body$hidDefLevel': ''
}

# get the list of options
strainer = SoupStrainer('div', attrs={'id': 'Body_courseList_tabSelect'})
options = soup.select('#Body_ddlSchool > option')
for option in options:
    print "Processing {option} ...".format(option=option.text)

    data['ctl00$Body$ddlSchool'] = option.get('value')

    # make the search post request for a particular option
    response = session.post('https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx',
                            data=data)
    result_soup = BeautifulSoup(response.content, parse_only=strainer)
    print [item.text[:20].replace('&nbsp', ' ') + '...' for item in result_soup.select('div.CrsOpen')]

打印:

Processing Architecture ...
[u'A46 ARCH 100...', u'A46 ARCH 111...', u'A46 ARCH 209...', u'A46 ARCH 211...', u'A46 ARCH 266...', u'A46 ARCH 305...', u'A46 ARCH 311...', u'A46 ARCH 323...', u'A46 ARCH 328...', u'A46 ARCH 336...', u'A46 ARCH 343...', u'A46 ARCH 350...', u'A46 ARCH 355...', u'A46 ARCH 411...', u'A46 ARCH 422...', u'A46 ARCH 428...', u'A46 ARCH 436...', u'A46 ARCH 445...', u'A46 ARCH 447...', u'A46 ARCH 465...', u'A48 LAND 451...', u'A48 LAND 453...', u'A48 LAND 461...']
Processing Art ...
[u'F10 ART 1052...', u'F10 ART 1073...', u'F10 ART 213A...', u'F10 ART 215A...', u'F10 ART 217B...', u'F10 ART 221A...', u'F10 ART 231I...', u'F10 ART 241D...', u'F10 ART 283T...', u'F10 ART 301A...', u'F10 ART 311E...', u'F10 ART 313D...', u'F10 ART 315B...', u'F10 ART 317H...', u'F10 ART 323A...', u'F10 ART 323B...', u'F10 ART 323C...', u'F10 ART 329C...', u'F10 ART 337E...', u'F10 ART 337F...', u'F10 ART 337H...', u'F10 ART 385A...', u'F10 ART 391M...', u'F10 ART 401A...', u'F10 ART 411E...', u'F10 ART 413D...', u'F10 ART 415B...', u'F10 ART 417H...', u'F10 ART 423A...', u'F10 ART 423B...', u'F10 ART 423C...', u'F10 ART 429C...', u'F10 ART 433C...', u'F10 ART 433D...', u'F10 ART 433E...', u'F10 ART 433K...', u'F10 ART 461C...', u'F10 ART 485A...', u'F20 ART 111P...', u'F20 ART 115P...', u'F20 ART 1186...', u'F20 ART 119C...', u'F20 ART 127A...', u'F20 ART 133B...', u'F20 ART 135G...', u'F20 ART 135I...', u'F20 ART 135J...', u'F20 ART 1361...', u'F20 ART 1363...', u'F20 ART 1713...', u'F20 ART 219C...', u'F20 ART 2363...', u'F20 ART 2661...', u'F20 ART 281S...', u'F20 ART 311P...', u'F20 ART 315P...', u'F20 ART 3183...', u'F20 ART 333B...', u'F20 ART 335A...', u'F20 ART 335J...', u'F20 ART 3713...', u'F20 ART 381S...', u'F20 ART 415P...', u'F20 ART 435I...']
...

当然,这里还有一些需要改进的地方,例如,我已经对其他表单值进行了硬编码-您可能应该解析可能的值并适当地设置它们.

There are certainly things to improve here, like, I've hardcoded the other form values - you should probably parse the possible values and set them appropriately.

另一个改进是将其与 grequests :

Another improvement would be to tie this up to grequests:

GRequests允许您将Requests与Gevent一起使用以进行异步 轻松进行HTTP请求.

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.


如您所见,当您处于较高级别并通过webdriver与浏览器进行交互时,您不必担心服务器实际请求来获取数据.这使得自动化很容易,但是会很慢.当您使用低级自动化时,您可以使用更多选项来加快处理速度,但是实现的复杂性会非常快速地增长.另外,请考虑这种解决方案的可靠性如何.因此,可以坚持使用黑匣子"解决方案,而继续使用selenium吗?

我还尝试使用以下方法解决问题:

I've also tried to solve the problem using:

,但是由于不同的原因而失败(可以为您提供相关的错误消息).不过,所有这三个工具都应该有助于简化解决方案.

but failed because of different reasons (can provide you with the relevant error messages). Though, all these 3 tools should have helped to simplify the solution.

也可以看到类似的主题:

Also see similar threads:

  • post request using python to asp.net page
  • how to submit query to .aspx page in python
  • Submitting a post request to an aspx page
  • How to get along with an ASP webpage

这篇关于加快美丽汤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆