如何从aspx页面中抓取图像? [英] How to scrape images from a aspx page?

查看:98
本文介绍了如何从aspx页面中抓取图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 aspx 页面抓取图像我有这个代码可以从普通网页抓取图像但无法抓取 aspx 页面因为我需要将 http 发布请求发送到 aspx 页面我即使在阅读了几个线程之后也无法弄清楚如何做到这一点,这是原始代码

i am trying to scrape images from a aspx page i have this code that scrapes images from normal webpage but can't scrape aspx page cause i need to send http post requests to the aspx page i can't figure out how to do that even after reading few threads this is the original code

from bs4 import BeautifulSoup as bs
import urlparse
import urllib2
from urllib import urlretrieve
import os
import sys
import subprocess
import re


def thefunc(url, out_folder):

    c = False

我已经定义了 aspx 页面的标题和区分普通页面和 aspx 页面的 if 语句

    select =  raw_input('Is this a .net  aspx page ? y/n : ')
    if select.lower().startswith('y'):
        usin = raw_input('Specify origin of .net page : ')
        usaspx = raw_input('Specify aspx page url : ')

aspx 页面的标题

        headdic = {
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Origin': usin,
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': usaspx,
            'Accept-Encoding': 'gzip,deflate,sdch',
            'Accept-Language': 'en-US,en;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
        }
        c = True

    if c:
        req = urllib2.Request(url, headers=headic)
    else:
        req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
    resp = urllib2.urlopen(req)
    
    soup = bs(resp, 'lxml')
    
    parsed = list(urlparse.urlparse(url))

    print '\n',len(soup.findAll('img')), 'images are about to be downloaded'

    for image in soup.findAll("img"):
        
        print "Image: %(src)s" % image
        
        filename = image["src"].split("/")[-1]
        
        parsed[2] = image["src"]
        
        outpath = os.path.join(out_folder, filename)

        try:
        
            if image["src"].lower().startswith("http"):
                urlretrieve(image["src"], outpath)
            else:
                urlretrieve(urlparse.urlunparse(parsed), outpath)
        except:
            print 'OOPS missed one for some reason !!'
            pass


try:
    put =  raw_input('Please enter the page url : ')
    reg1 = re.compile('^http*',re.IGNORECASE)
    reg1.match(put)
except:
    print('Type the url carefully !!')
    sys.exit()
fol = raw_input('Enter the foldername to save the images : ')
if os.path.isdir(fol):
    thefunc(put, fol)
else:
    subprocess.call('mkdir', fol)
    thefunc(put, fol)

我对 aspx 检测和创建 aspx 页面的标题做了一些修改,但是接下来如何修改我被困在这里

***这里是 aspx 页面链接*** http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx

抱歉,如果我不清楚,因为您可以看到我是编程新手,我要问的问题是当我单击下一页时如何从 aspx 页面获取图像浏览器中的按钮导致如果我只能抓取一页导致 url 不会改变,除非我发送一个 http 帖子以某种方式告诉页面显示带有新图片的下一页,因为 url 保持不变,我希望我很清楚

推荐答案

您可以通过将正确数据发布到 url 来使用请求来实现,您可以从初始页面解析这些数据:

You can do it using requests by posting to the url with the correct data which you can parse from the initial page:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
from itertools import chain

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"


def validate(soup):
    return {"__VIEWSTATE": soup.select_one("#__VIEWSTATE")["value"],
            "__VIEWSTATEGENERATOR": soup.select_one("#__VIEWSTATEGENERATOR")["value"],
            "__EVENTVALIDATION": soup.select_one("#__EVENTVALIDATION")["value"]}


def parse(base, url):
    data = {"__ASYNCPOST": "true"
            }
    h = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17'}
    soup = BeautifulSoup(requests.get(url).text)
    data.update(validate(soup))
    # gets links for < 1,2,3,4,5,6>
    pages = [a["id"] for a in soup.select("a[id^=ctl01_ctl00_pbsc1_pbPagerBottom_btnP]")][2:]
    # get images from initial page
    yield [img["src"] for img in soup.select("img")]
    # add token for post 
    data.update(validate(soup))
    for p in pages:
        # we need $ in place of _ for the form data
        data["__EVENTTARGET"] = p.replace("_", "$")
        data["RadScriptManager1"] = "ctl01$ctl00$pbsc1$ctl01$ctl00$pbsc1$ajaxPanel1Panel|{}".format(p.replace("_", "$"))
        r = requests.post(url, data=data, headers=h).text
        soup = BeautifulSoup(r)
        yield [urljoin(base, img["src"]) for img in soup.select("img")]


for url in chain.from_iterable(parse("http://www.foxrun.com.au/", url)):
    print(url)

这会给你链接,你只需要下载内容并将其写入文件.通常我们可以创建一个 Session 并从一个页面转到下一个页面,但在这种情况下发布的是 ctl01$ctl00$pbsc1$pbPagerBottom$btnNext ,它可以正常工作从第一页到第二页,但没有从第二页到第三页等等的概念,因为我们在表单数据中没有页码.

That will give you the links, you just have to download the content and write it to file. Normally we could create a Session and go from one page to the next but in this case what is posted is ctl01$ctl00$pbsc1$pbPagerBottom$btnNext which would work fine going from the initial page to the second but there is no concept of going from the second to the third etc.. as we have no page number in the form data.

这篇关于如何从aspx页面中抓取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆