如何在python中与beautifulsoup并行地刮除多个html页面? [英] How to scrap multiple html page in parallel with beautifulsoup in python?

查看:53
本文介绍了如何在python中与beautifulsoup并行地刮除多个html页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Django网络框架在Python中制作一个网络抓取应用程序.我需要使用beautifulsoup库取消多个查询.这是我编写的代码的快照:

I'm making a webscraping app in Python with Django web framework. I need to scrap multiple queries using beautifulsoup library. Here is snapshot of code that I have written:

for url in websites:
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.find_all("a", {"class":"dev-link"})

实际上,这里的网页抓取是按顺序进行的,我想以并行方式运行它.我对使用Python线程并不太了解. 有人可以告诉我,如何并行进行报废?任何帮助,将不胜感激.

Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python. can someone tell me, How can I do scrap in parallel manner? Any help would be appreciated.

推荐答案

当涉及到python和抓取时, scrapy 可能是要走的路.

when it comes to python and scraping, scrapy is probably the way to go.

scrapy使用 twisted mertix 库进行并行处理,因此您不必担心线程和 Python GIL

scrapy is using twisted mertix library for parallelism so you dont have to worry about threading and the python GIL

如果必须使用beautifulsoap,请检查此库退出

If you must use beautifulsoap check this library out

这篇关于如何在python中与beautifulsoup并行地刮除多个html页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆