Beautifulsoup findall被卡住而不进行处理 [英] Beautifulsoup findall get stuck without processing
问题描述
我试图了解BeautifulSoup,并试图在facebook.com中找到所有链接,并对其中的每个链接进行迭代...
I'm trying to understand BeautifulSoup and tried want to find all the links within facebook.com and iterate each and every link within it...
这是我的代码...可以正常工作,但是一旦找到Linkedin.com并对其进行迭代,它就会卡在此URL后的某个位置-
Here is my code...it works fine but once it finds Linkedin.com and iterates over it, it get stuck at a point after this URL - http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj
当我分别运行Linkedin.com时,我没有任何问题...
When I run Linkedin.com separately, I don't have any problem...
这可能是我操作系统中的限制吗?我正在使用Ubuntu Linux ...
Could this be a limitation within my operating system..Im using Ubuntu Linux...
import urllib2
import BeautifulSoup
import re
def main_process(response):
print "Main process started"
soup = BeautifulSoup.BeautifulSoup(response)
limit = '5'
count = 0
main_link = valid_link = re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$","http://www.facebook.com")
if main_link:
main_link = main_link.group(1)
print 'main_link = ', main_link
result = {}
result[main_link] = {'incoming':[],'outgoing':[]}
print 'result = ', result
for link in soup.findAll('a',href=True):
if count < 10:
valid_link = re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$",link.get('href'))
if valid_link:
#print 'Main link = ', link.get('href')
print 'Links object = ', valid_link.group(1)
connecting_link = valid_link.group(1)
connecting_link = connecting_link.encode('ascii')
if main_link <> connecting_link:
print 'outgoing link = ', connecting_link
result = add_new_link(connecting_link, result)
#Check if the outgoing is already added, if its then don't add it
populate_result(result,main_link,connecting_link)
print 'result = ', result
print 'connecting'
request = urllib2.Request(connecting_link)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for sublink in soup.findAll('a',href=True):
print 'sublink = ', sublink.get('href')
valid_link = re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$",sublink.get('href'))
if valid_link:
print 'valid_link = ', valid_link.group(1)
valid_link = valid_link.group(1)
if valid_link <> connecting_link:
populate_result(result,connecting_link,valid_link)
count += 1
print 'final result = ', result
# print 'found a url with national-park in the link'
def add_new_link(connecting_link, result):
result[connecting_link] = {'incoming':[],'outgoing':[]}
return result
def populate_result(result,link,dest_link):
if len(result[link]['outgoing']) == 0:
result[link]['outgoing'].append(dest_link)
else:
found_in_list = 'Y'
try:
result[link]['outgoing'].index(dest_link)
found_in_list = 'Y'
except ValueError:
found_in_list = 'N'
if found_in_list == 'N':
result[link]['outgoing'].append(dest_link)
return result
if __name__ == "__main__":
request = urllib2.Request("http://facebook.com")
print 'process start'
try:
response = urllib2.urlopen(request)
main_process(response)
except urllib2.URLError, e:
print "URLERROR"
print "program ended"
推荐答案
问题出在以下行的某些URL上挂起re.search()
:
The problem is in hanging re.search()
on certain URLs on this line:
valid_link = re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$", sublink.get('href'))
例如,它挂在https://www.facebook.com/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto
网址上:
>>> import re
>>> s = "https://www.facebook.com/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto"
>>> re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$", s)
hanging "forever"...
看起来,它引入了灾难性回溯案例,导致正则表达式搜索挂起
Looks like, it introduces a Catastrophic Backtracking case that causes regex search to hang.
一种解决方案是使用其他正则表达式来验证URL,请在此处查看许多选项:
One solution would be to use a different regex for validating the URL, see plenty of options here:
希望有帮助.
这篇关于Beautifulsoup findall被卡住而不进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!