单击带有Python BeautifulSoup的链接 [英] Clicking links with Python BeautifulSoup

查看:56
本文介绍了单击带有Python BeautifulSoup的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我是Python的新手(我来自PHP/JavaScript背景),但我只想编写一个快速脚本,对网站和所有子页面进行爬网以查找所有 a 标签使用 href 属性,计算有多少,然后单击链接.我可以计算所有链接,但无法弄清楚如何单击"链接然后返回响应代码.

So I'm new to Python (I come from a PHP/JavaScript background), but I just wanted to write a quick script that crawled a website and all children pages to find all a tags with href attributes, count how many there are and then click the link. I can count all of the links, but I can't figure out how to "click" the links and then return the response codes.

from bs4 import BeautifulSoup
import urllib2
import re

def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    countMe += len(i)
    # Click on links and return responses

print countMe

使用 BeautifulSoup 甚至有可能吗?
另外,我并不是在寻找确切的代码,我真正想要的只是像在正确的方向上使用函数调用之类的东西.谢谢!

Is this even possible with BeautifulSoup?
Also, I'm not looking for exact code, all I'm really looking for is like a point in the right direction for function calls to use or something like that. Thanks!

推荐答案

因此,在注释的帮助下,我决定像这样使用urlopen:

So with help from the comments, I decided to just use urlopen like this:

from bs4 import BeautifulSoup
import urllib.request
import re

def getLinks(url):
    html_page = urllib.request.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
    happens = urllib.request.urlopen(anchor)
    if happens.getcode() == "404":
        # Do stuff
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    countMe += len(i)
    happens = urllib.request.urlopen(i)
    if happens.getcode() == "404":
        # Do some stuff

print(countMe)

我在if语句中有自己的论点

I've got my own arguments in the if statements

这篇关于单击带有Python BeautifulSoup的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆