Python urllib2 + Beautifulsoup [英] Python urllib2 + Beautifulsoup

查看:84
本文介绍了Python urllib2 + Beautifulsoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在努力在当前的python项目中实现美化,好吧,为了保持简洁明了,我将减少当前脚本的复杂性.

So I'm struggling to implement beautiful into my current python project, Okay so to keep this plain and simple I'll reduce the complexity of my current script.

没有BeautifulSoup的脚本-

Script without BeautifulSoup -

import urllib2

    def check(self, name, proxy):
        urllib2.install_opener(
            urllib2.build_opener(
                urllib2.ProxyHandler({'http': 'http://%s' % proxy}),
                urllib2.HTTPHandler()
                )
            )

        req = urllib2.Request('http://example.com' ,"param=1")
        try:
            resp = urllib2.urlopen(req) 
        except:
            self.insert()
        try:
            if 'example text' in resp.read()
               print 'success'

现在当然缩进是错误的,这只是我正在做的事情的简要说明,您可以简单地说,就是向"example.com"发送发帖请求&则example.com如果在重新读取成功后包含"example text".

now of course the indentation is wrong, this is just sketch up of what I have going on, as you can in simple terms I'm sending a post request to " example.com " & then if example.com contains " example text " in resp.read print success.

但是我真正想要检查的是

But what I actually want is to check

if ' example ' in resp.read()

然后输出 td中的文本使用

then output text inside td align from example.com request using

soup.find_all('td', {'align':'right'})[4]

现在,我实现Beautifulsoup的方式不起作用,例如-

Now the way I'm implementing beautifulsoup isn't working, example of this -

import urllib2
from bs4 import BeautifulSoup as soup

main_div = soup.find_all('td', {'align':'right'})[4]

    def check(self, name, proxy):
        urllib2.install_opener(
            urllib2.build_opener(
                urllib2.ProxyHandler({'http': 'http://%s' % proxy}),
                urllib2.HTTPHandler()
                )
            )

        req = urllib2.Request('http://example.com' ,"param=1")
        try:
            resp = urllib2.urlopen(req) 
            web_soup = soup(urllib2.urlopen(req), 'html.parser')
        except:
            self.insert()
        try:
            if 'example text' in resp.read()
               print 'success' + main_div

现在您看到我添加了4个新行/调整项

Now you see I added 4 new lines/adjustments

from bs4 import BeautifulSoup as soup

web_soup = soup(urllib2.urlopen(url), 'html.parser')

main_div = soup.find_all('td', {'align':'right'})[4]

aswell as " + main_div " on print

但是似乎无法正常工作,我在调整一些错误时说了一些错误:在分配之前引用了本地变量"& 必须以beautifulsoup实例作为第一个参数来调用未绑定方法find_all"

However it just doesn't seem to be working, I've had a few errors whilst adjusting some of which have said " Local variable referenced before assignment " & " unbound method find_all must be called with beautifulsoup instance as first argument "

推荐答案

关于最后一个代码段:

from bs4 import BeautifulSoup as soup

web_soup = soup(urllib2.urlopen(url), 'html.parser')
main_div = soup.find_all('td', {'align':'right'})[4]

您应该在web_soup实例上调用find_all.另外,请确保在使用url变量之前先定义它:

You should call find_all on the web_soup instance. Also be sure to define the url variable before you use it:

from bs4 import BeautifulSoup as soup

url = "url to be opened"
web_soup = soup(urllib2.urlopen(url), 'html.parser')
main_div = web_soup.find_all('td', {'align':'right'})[4]

这篇关于Python urllib2 + Beautifulsoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆