如何使用Python通用地爬网不同的网站? [英] How to generically crawl different websites using Python?

查看:105
本文介绍了如何使用Python通用地爬网不同的网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 Dawn.com 和论坛报中提取评论.com的任何文章.

I want to extract comments from Dawn.com as well as from Tribune.com from any article.

我提取评论的方式是,在Tribune.com上将目标<div class="comment__body cf">定位在黎明,而将class ="content"定位

The way I'm extracting comments is, to target the class <div class="comment__body cf">on Dawn while class="content" on Tribune.com

我该如何通用?它的意思是, 这些网站上没有类似的模式可以通过一个班级实现.

How can I do it generically? It means, There is no similar pattern on these websites through which this can be achieve by one class.

我应该为每个网站分别编写代码吗?

Shall I write separate code for each website?

推荐答案

编写一种可以从网站或其他内容中获取所需内容的算法并非易事.因为,正如您所提到的,这里没有任何模式.有些人可以在该处添加他的站点的注释,并为其指定一个类名称,例如commentssite_comments或其他名称;有些人可以在此处添加其名称,并为其提供另一个类名称,依此类推.因此,我认为您需要弄清楚类名或要删除网站内容的任何内容.

It is not so easy to write an algorithm that can generically grab the wanted content from a website or something. Because, as you've mentioned, there is no any pattern here. Some can put comments of his site there and give it a class name like comments or site_comments or whatever and some can put it here and give it another class name and so on and so forth. So what I think is you need to figure out the class names or whatever you want to select to scrap the content of the website.

尽管如此,如果您不想为它们编写单独的代码,我认为您可以使用BeautifulSoup's regex功能.

Nevertheless, in your case, if you don't want to write separate code for them I think that you can use BeautifulSoup's regex functionality.

例如,您可以执行以下操作:

For example you can do something like this:

from bs4 import BeautifulSoup
import requests

site_urls = [first_site, second_site]
for site in site_urls:
    # this is just an example and in real life situations 
    # you should do some error checking
    site_content = requests.get(site)
    soup = BeautifulSoup(site_content, 'html5lib')
    # this is the list of html tags with the current site's comments 
    # and you can do whatever you want with them
    comments = soup.find_all(class_=re.compile("(comment)|(content)"))

此处的文档非常好.您应该检查一下.

They have a very nice documentation here. You should check it.

这篇关于如何使用Python通用地爬网不同的网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆