自动化 Python Web Crawler - 如何始终防止 raw_input? [英] Automating a Python Web Crawler - How to prevent raw_input all the time?

查看:18
本文介绍了自动化 Python Web Crawler - 如何始终防止 raw_input?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试创建一个 Python Web Crawler,它可以查找网页、读取链接列表、在预先指定的位置返回链接,并执行特定次数(由计数变量定义).我的问题是我还没有找到一种方法来自动化这个过程,我必须不断地输入代码找到的链接.

I have been trying to create a Python Web Crawler that finds a web page, read a list of links, returns the link in pre-specified position, and does that for a certain number of times (defined by the count variable). My issue is that I have not been able to find a way to automate the process, and I have to continuously input the link that the code finds.

这是我的代码:第一个 URL 是 http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html

Here is my code: The first URL is http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html

count_1 等于 7

The count_1 is equal to 7

位置等于8

##Here is my code:

import urllib 
from bs4 import BeautifulSoup

count_1 = raw_input('Enter count: ')
position = raw_input('Enter position: ')
count = int(count_1)

while count > 0:
    list_of_tags = list()
    url = raw_input("Enter URL: ")
    fhand = urllib.urlopen(url).read()
    soup = BeautifulSoup(fhand,"lxml")
    tags = soup("a")
    for tag in tags:
        list_of_tags.append(tag.get("href",None))
    print list_of_tags[int(position)]
    count -=1

感谢所有帮助

推荐答案

我准备了一些带有注释的代码.如果您有任何疑问或其他问题,请告诉我.

I've prepared some code with comments. Let me know if you have any doubts or further questions.

给你:

import requests
from lxml import html


def searchRecordInSpecificPosition(url, position):
    ## Making request to the specified URL
    response = requests.get(url)

    ## Parsing the DOM to a tree
    tree = html.fromstring(response.content)

    ## Creating a dict of links.
    links_dict = dict()

    ## Format of the dictionary:
    ##
    ##  {
    ##      1: {
    ##          'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html",
    ##          'text': "Medina"
    ##      },
    ##      
    ##      2: {
    ##          'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Chiara.html",
    ##          'text': "Chiara"
    ##      },
    ##  
    ##      ... and so on...
    ## }

    counter = 1

    ## For each <a> tag found, extract its text and link (href) and insert it into links_dict
    for link in tree.xpath('//ul/li/a'):
        href = link.xpath('.//@href')[0]
        text = link.xpath('.//text()')[0]
        links_dict[counter] = dict(href=href, text=text)
        counter += 1

    return links_dict[position]['text'], links_dict[position]['href']


times_to_search = int(raw_input("Enter the amount of times to search: "))
position = int(raw_input('Enter position: '))

count = 0

print ""

while count < times_to_search:
    if count == 0:
        name, url = searchRecordInSpecificPosition("http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html", position)
    else:
        name, url = searchRecordInSpecificPosition(url, position)
    print "[*] Name: {}".format(name)
    print "[*] URL: {}".format(url)
    print ""
    count += 1

示例输出:

➜  python scraper.py
Enter the amount of times to search: 4
Enter position: 1

[*] Name: Medina
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html

[*] Name: Darrius
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Darrius.html

[*] Name: Caydence
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Caydence.html

[*] Name: Peaches
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Peaches.html

➜ 

这篇关于自动化 Python Web Crawler - 如何始终防止 raw_input?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆