单击带有Python BeautifulSoup的链接 [英] Clicking links with Python BeautifulSoup
问题描述
因此,我是Python的新手(我来自PHP/JavaScript背景),但我只想编写一个快速脚本,对网站和所有子页面进行爬网以查找所有 a
标签使用 href
属性,计算有多少,然后单击链接.我可以计算所有链接,但无法弄清楚如何单击"链接然后返回响应代码.
So I'm new to Python (I come from a PHP/JavaScript background), but I just wanted to write a quick script that crawled a website and all children pages to find all a
tags with href
attributes, count how many there are and then click the link. I can count all of the links, but I can't figure out how to "click" the links and then return the response codes.
from bs4 import BeautifulSoup
import urllib2
import re
def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
anchors = getLinks("http://madisonmemorial.org/")
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
i = getLinks(anchor)
countMe += len(i)
# Click on links and return responses
print countMe
使用 BeautifulSoup
甚至有可能吗?
另外,我并不是在寻找确切的代码,我真正想要的只是像在正确的方向上使用函数调用之类的东西.谢谢!
Is this even possible with BeautifulSoup
?
Also, I'm not looking for exact code, all I'm really looking for is like a point in the right direction for function calls to use or something like that. Thanks!
推荐答案
因此,在注释的帮助下,我决定像这样使用urlopen:
So with help from the comments, I decided to just use urlopen like this:
from bs4 import BeautifulSoup
import urllib.request
import re
def getLinks(url):
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
happens = urllib.request.urlopen(anchor)
if happens.getcode() == "404":
# Do stuff
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
i = getLinks(anchor)
countMe += len(i)
happens = urllib.request.urlopen(i)
if happens.getcode() == "404":
# Do some stuff
print(countMe)
我在if语句中有自己的论点
I've got my own arguments in the if statements
这篇关于单击带有Python BeautifulSoup的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!