在python中解析Robots.txt [英] Parsing Robots.txt in python

查看：44 发布时间：2021/7/10 19:18:59 python robots.txt

本文介绍了在python中解析Robots.txt的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用python解析robots.txt文件.我已经探索过robotParser 和robotExclusionParser，但没有什么能真正满足我的标准.我想一次性获取所有 diallowedUrls 和 allowedUrls，而不是手动检查每个 url 是否允许.有没有图书馆可以做到这一点?

I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this?

推荐答案

为什么要手动检查网址?你可以在 Python 3 中使用 urllib.robotparser，然后做这样的事情

Why do you have to check your urls manually ? You can use urllib.robotparser in Python 3, and do something like this

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup


url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]

    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrap")

这篇关于在python中解析Robots.txt的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中解析Robots.txt [英] Parsing Robots.txt in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中解析Robots.txt [英] Parsing Robots.txt in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭