简单的网络爬虫 [英] simple web crawler
问题描述
我在Python写了下面的程序非常简单的网络爬虫,但是当我运行它,它恢复了我
'NoneType'对象不是可调用的',你能帮帮我吗?
进口BeautifulSoup
进口的urllib2
高清接头(P,Q):
为电子商务在问:
如果E不在号码:
p.append(五)高清爬虫(SeedUrl):
tocrawl = [SeedUrl]
爬= []
而tocrawl:
页= tocrawl.pop()
pagesource = urllib2.urlopen(页)
S = pagesource.read()
汤= BeautifulSoup.BeautifulSoup(S)
链接=汤('A')
如果页面不抓取:
工会(tocrawl,链接)
crawled.append(页) 返回爬网
履带('http://www.princeton.edu/main/')
汤('A')返回完整的HTML标记。
< A HREF =http://itunes.apple.com/us/store>购买音乐现在和LT; / A>
这样的的urlopen 提供错误
'NoneType'对象不是可调用的。你需要解压的唯一的URL / HREF。
链接= soup.findAll('A'中,href = TRUE)
对于L中的链接:
打印(L [HREF'])
您需要的URL too.refer验证到以下anwsers
-
<一个href=\"http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-ex$p$pssion-in-python\">How你验证与Python中的常规前pression网址?
-
<一个href=\"http://stackoverflow.com/questions/7160737/python-how-to-validate-a-url-in-python-malformed-or-not\">Python - 如何验证Python中的网址是什么? (格式不正确或不)
我再次想建议你使用Python套代替Arrays.you可以轻松地添加,ommit重复的URL。
请尝试以下code:
进口重
给出import httplib
进口的urllib2
从进口里urlparse里urlparse
进口BeautifulSoup正则表达式= re.compile(
R'^(?:HTTP | FTP)S://'#http://或https://开头
R'(:(:[A-Z0-9]([A-Z0-9 - ] {0,61} [A-Z0-9])\\)+(?:????[AZ] {2,6} \\ | [A-Z0-9 - ] {2} \\)|。?。?#domain ...
r'localhost | #localhost ...
R'\\ D {1,3} \\。\\ D {1,3} \\。\\ D {1,3} \\。\\ D {1,3})'#...或IP
R'(?:: \\ D +)? #可选端口
R'(?:?| [/?] / \\ S +)$',re.IGNORECASE)高清isValidUrl(URL):
如果regex.match(URL)不无:
返回True;
返回False高清爬虫(SeedUrl):
tocrawl = [SeedUrl]
爬= []
而tocrawl:
页= tocrawl.pop()
打印抓取:'+页
pagesource = urllib2.urlopen(页)
S = pagesource.read()
汤= BeautifulSoup.BeautifulSoup(S)
链接= soup.findAll('A'中,href = TRUE)
如果页面不抓取:
对于L中的链接:
如果isValidUrl(L ['的href']):
tocrawl.append(L ['的href'])
crawled.append(页)
返回爬网
履带('http://www.princeton.edu/main/')
i wrote below program in python for very simple web crawler, but when i run it it return me 'NoneType' object is not callable' , could you please help me?
import BeautifulSoup
import urllib2
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def crawler(SeedUrl):
tocrawl=[SeedUrl]
crawled=[]
while tocrawl:
page=tocrawl.pop()
pagesource=urllib2.urlopen(page)
s=pagesource.read()
soup=BeautifulSoup.BeautifulSoup(s)
links=soup('a')
if page not in crawled:
union(tocrawl,links)
crawled.append(page)
return crawled
crawler('http://www.princeton.edu/main/')
soup('a') returns the complete html tag.
<a href="http://itunes.apple.com/us/store">Buy Music Now</a>
so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.
links=soup.findAll('a',href=True)
for l in links:
print(l['href'])
You need to validate the url too.refer to following anwsers
How do you validate a URL with a regular expression in Python?
Python - How to validate a url in python ? (Malformed or not)
Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.
Try the following code:
import re
import httplib
import urllib2
from urlparse import urlparse
import BeautifulSoup
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
def isValidUrl(url):
if regex.match(url) is not None:
return True;
return False
def crawler(SeedUrl):
tocrawl=[SeedUrl]
crawled=[]
while tocrawl:
page=tocrawl.pop()
print 'Crawled:'+page
pagesource=urllib2.urlopen(page)
s=pagesource.read()
soup=BeautifulSoup.BeautifulSoup(s)
links=soup.findAll('a',href=True)
if page not in crawled:
for l in links:
if isValidUrl(l['href']):
tocrawl.append(l['href'])
crawled.append(page)
return crawled
crawler('http://www.princeton.edu/main/')
这篇关于简单的网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!