如何从aspx页面中删除图片? [英] How to scrap images from a aspx page?

查看:130
本文介绍了如何从aspx页面中删除图片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从一个aspx页面中删除图片我有这个代码从正常的网页中删除图片,但无法刮取aspx页面,因为我需要发送http发布请求到aspx页面我无法弄清楚如何即使在阅读少量线程后,这是原始代码

  from bs4 import BeautifulSoup as bs 
import urlparse
从urllib导入urllib2
import urlretrieve
导入os
导入sys
导入子流程
导入重新导入


def thefunc (url,out_folder):

c = False

我有已经为aspx页面定义了头文件,if语句区分了普通页面和aspx页面。 '这是一个.net aspx页面吗?/ n:')
如果select.lower()。startswith('y'):
usin = raw_input('指定.net页面的原点:' )
usaspx = raw_input('指定aspx页面url: ')

aspx页面的标题

  headdic = {
'Accept':'text / html,application / xhtml + xml,application / xml; q = 0.9,* / * ; q = 0.8',
'Origin':usin,
'User-Agent':'Mozilla / 5.0(Windows NT 6.1)AppleWebKit / 537.17(KHTML,像Gecko)Chrome / 24.0.1312.57 Safari /537.17',
'Content-Type':'application / x-www-form-urlencoded',
'Referer':usaspx,
'Accept-Encoding':'gzip,deflate ,sdch',
'Accept-Language':'en-US,en; q = 0.8',
'Accept-Charset':'ISO-8859-1,utf-8; q = 0.7 ,*; q = 0.3'
}
c = True

如果c:
req = urllib2.Request(url,headers = headic)
else :
req = urllib2.Request(url,headers = {'User-Agent':Magic Browser})
resp = urllib2.urlopen(req)

汤= bs(resp,'lxml')

parsed = list(urlparse.urlparse(url))

print'\\\
',len(soup.findAll 'img')),'images are being be downloaded'

for soup.findAll(img):

printImage:%(src) s%image

filename = image [src]。split(/)[ - 1]

parsed [2] = image [src]

outpath = os.path.join(out_folder,filename)

尝试:

如果image [src]。lower()。 startswith(http):
urlretrieve(image [src],outpath)
else:
urlretrieve(urlparse.urlunparse(parsed),outpath)
除外:$ 'b $ b print'OOPS因某种原因错过了'!'
通过


尝试:
put = raw_input('请输入网页网址:')
reg1 = re.compile('^ http *',re.IGNORECASE)
reg1.match(put)
除外:
print('Type t ')
sys.exit()
fol = raw_input('输入文件名来保存图像:')
if os.path.isdir(fol):
thefunc(put,fol)
else:
subprocess.call('mkdir',fol)
thefunc(put,fol)

我已经对aspx检测进行了少量修改,并为aspx页面创建标题,但是如何修改下一个我卡在这里



***这里是aspx页面链接*** http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx



抱歉,如果我不清楚,你可以看到我是编程新手,我所问的问题是如何获得我从aspx页当我点击浏览器中的下一页按钮,因为如果我只能刮一页导致该网址不会改变,除非我发送一个HTTP帖子莫名其妙地告诉帕格e显示下一页新图片,因为网址保持不变,我希望我很清楚 >

您可以通过使用请求发布到网址,并使用您可以从初始页面解析的正确数据来完成请求:

 从bs4导入请求
导入BeautifulSoup $ b $ from urlparse导入urljoin $ b $ from itertools导入链接

url =http://www.foxrun.com.au /Products/Cylinders_with_Gadgets.aspx


def validate(soup):
return {__VIEWSTATE:soup.select_one(#__ VIEWSTATE)[value],
__VIEWSTATEGENERATOR:soup.select_one(#__ VIEWSTATEGENERATOR)[value],
__EVENTVALIDATION:soup.select_one(#__ EVENTVALIDATION)[value]}


def parse(base,url):
data = {__ASYNCPOST:true
}
h = {
'User-Agent' :'Mozilla / 5.0(Windows NT 6.1)AppleWebKit / 537.17(KHTML,像Geck o)Chrome / 24.0.1312.57 Safari / 537.17'}
soup = BeautifulSoup(requests.get(url).text)
data.update(validate(soup))
# < 1,2,3,4,5,6>
pages = [a [id] for a soup.select(a [id ^ = ctl01_ctl00_pbsc1_pbPagerBottom_btnP])] [2:]
#从初始页面获取图像
产量[img [src] for img in soup.select(img)]
#添加令牌为
data.update(验证(汤))
为页面中的p:
#我们需要$来代替_表格数据
data [__ EVENTTARGET] = p.replace(_,$)
data [RadScriptManager1] = ctl01 $ ctl00 $ pbsc1 $ ctl01 $ ctl00 $ pbsc1 $ ajaxPanel1Panel | {}。格式(p.replace(_,$))
r = requests.post(url,data = data,headers = h).text
soup = BeautifulSoup(r)
yield [urljoin(base,img [src])for img in soup.select(img)]


for url.from_iterable(parse(http://www.foxrun.com.au/,url)):
print(url)

这会为您提供链接,您只需下载内容并将其写入文件即可。通常,我们可以创建一个 Session 并从一个页面转到下一个页面,但在这种情况下,发布的内容是 ctl01 $ ctl00 $ pbsc1 $ pbPagerBottom $ btnNext 从最初的页面到第二页都可以正常工作,但是没有从第二页到第三页等的概念,因为我们在表单数据中没有页码。


i am trying to scrap images from a aspx page i have this code that scrapes images from normal webpage but can't scrape aspx page cause i need to send http post requests to the aspx page i can't figure out how to do that even after reading few threads this is the original code

from bs4 import BeautifulSoup as bs
import urlparse
import urllib2
from urllib import urlretrieve
import os
import sys
import subprocess
import re


def thefunc(url, out_folder):

    c = False

i have already defined headers for aspx page and a if statement that distinguished between normal page and aspx page

    select =  raw_input('Is this a .net  aspx page ? y/n : ')
    if select.lower().startswith('y'):
        usin = raw_input('Specify origin of .net page : ')
        usaspx = raw_input('Specify aspx page url : ')

the header for aspx page

        headdic = {
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Origin': usin,
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': usaspx,
            'Accept-Encoding': 'gzip,deflate,sdch',
            'Accept-Language': 'en-US,en;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
        }
        c = True

    if c:
        req = urllib2.Request(url, headers=headic)
    else:
        req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
    resp = urllib2.urlopen(req)

    soup = bs(resp, 'lxml')

    parsed = list(urlparse.urlparse(url))

    print '\n',len(soup.findAll('img')), 'images are about to be downloaded'

    for image in soup.findAll("img"):

        print "Image: %(src)s" % image

        filename = image["src"].split("/")[-1]

        parsed[2] = image["src"]

        outpath = os.path.join(out_folder, filename)

        try:

            if image["src"].lower().startswith("http"):
                urlretrieve(image["src"], outpath)
            else:
                urlretrieve(urlparse.urlunparse(parsed), outpath)
        except:
            print 'OOPS missed one for some reason !!'
            pass


try:
    put =  raw_input('Please enter the page url : ')
    reg1 = re.compile('^http*',re.IGNORECASE)
    reg1.match(put)
except:
    print('Type the url carefully !!')
    sys.exit()
fol = raw_input('Enter the foldername to save the images : ')
if os.path.isdir(fol):
    thefunc(put, fol)
else:
    subprocess.call('mkdir', fol)
    thefunc(put, fol)

i have made few modifications for aspx detection and creating the header for the aspx page but how to modify next i am stuck here

***here is the aspx page link*** http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx

sorry if i am not clear as you can see i am new to programming, the question i am asking is how can i get the images i get from the aspx page when i am clicking the next page button in the browser cause if i can only scrape one page cause the url is not changing unless i send a http post somehow to tell the page to show the next page with new pictures, because the url stays the same i hope i am clear

解决方案

You can do it using requests by posting to the url with the correct data which you can parse from the initial page:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
from itertools import chain

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"


def validate(soup):
    return {"__VIEWSTATE": soup.select_one("#__VIEWSTATE")["value"],
            "__VIEWSTATEGENERATOR": soup.select_one("#__VIEWSTATEGENERATOR")["value"],
            "__EVENTVALIDATION": soup.select_one("#__EVENTVALIDATION")["value"]}


def parse(base, url):
    data = {"__ASYNCPOST": "true"
            }
    h = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17'}
    soup = BeautifulSoup(requests.get(url).text)
    data.update(validate(soup))
    # gets links for < 1,2,3,4,5,6>
    pages = [a["id"] for a in soup.select("a[id^=ctl01_ctl00_pbsc1_pbPagerBottom_btnP]")][2:]
    # get images from initial page
    yield [img["src"] for img in soup.select("img")]
    # add token for post 
    data.update(validate(soup))
    for p in pages:
        # we need $ in place of _ for the form data
        data["__EVENTTARGET"] = p.replace("_", "$")
        data["RadScriptManager1"] = "ctl01$ctl00$pbsc1$ctl01$ctl00$pbsc1$ajaxPanel1Panel|{}".format(p.replace("_", "$"))
        r = requests.post(url, data=data, headers=h).text
        soup = BeautifulSoup(r)
        yield [urljoin(base, img["src"]) for img in soup.select("img")]


for url in chain.from_iterable(parse("http://www.foxrun.com.au/", url)):
    print(url)

That will give you the links, you just have to download the content and write it to file. Normally we could create a Session and go from one page to the next but in this case what is posted is ctl01$ctl00$pbsc1$pbPagerBottom$btnNext which would work fine going from the initial page to the second but there is no concept of going from the second to the third etc.. as we have no page number in the form data.

这篇关于如何从aspx页面中删除图片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆