使用 Python 从 URL 列表中查找特定 URL [英] Finding specific URLs from a list of URLs using Python

查看:24
本文介绍了使用 Python 从 URL 列表中查找特定 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过爬行来查找特定链接是否存在于 URL 列表中.我编写了以下程序,它运行良好.但是,我被困在 2 个地方.

I want find if specific links exist in a list of URLs by crawling through them. I have written the following program and it works perfectly. However, I am stuck at 2 places.

  1. 如何从文本文件中调用链接,而不是使用数组.
  2. 抓取工具需要将近 4 分钟才能抓取 100 个网页.

有什么办法可以让它更快.

Is there a way I can make that faster.

from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
import threading

start = time.time()
#Links I want to find
url = "example.com/one", "example.com/two", "example.com/three"]

#Links I want to find the above links in...
url_list =["example.com/1000", "example.com/1001", "example.com/1002",
"example.com/1003", "example.com/1004"]

print_lock = threading.Lock()
#with open("links.txt") as f:
#  url_list1 = [url.strip() for url in f.readlines()]

def fetch_url(url):
    for line1 in url_list:
        print "Crawled" " " + line1
        try:
            html_page = urllib2.urlopen(line1)
            soup = BeautifulSoup(html_page)
            link = soup.findAll(href=True)
        except urllib2.HTTPError:
        pass
        for link1 in link:
            url1 = link1.get("href")
            for url_input in url:
                if url_input in url1:
                    with print_lock:
                        print 'Found' " " +url_input+ " " 'in'+ " " + line1

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('Entire job took:',time.time() - start) 

推荐答案

如果您想从文本文件中读取,请使用您注释掉的代码.

If you want to read from a textfile, use the code you commented out.

至于性能"问题:您的代码在读取操作 urlopen 处阻塞,直到返回网站内容.理想情况下,您希望并行运行这些请求.您需要一个并行化的解决方案,例如使用线程.

As for the "performance" problem: Your code blocks at the read operation urlopen until the content from the website is returned. You ideally want to run those requests in parallel. You need a parallelized solution, by using threads for example.

这是一个使用不同方法的示例,使用 gevent(非标准)

Here's an example using a different approach, using gevent (non-standard)

这篇关于使用 Python 从 URL 列表中查找特定 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆