通过美丽的汤访问主网站页面上的所有元素 [英] Accessing all elements from main website page with Beautiful Soup

查看:43
本文介绍了通过美丽的汤访问主网站页面上的所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从这个网站上抓新闻:

I want to scrape news from this website:

https://www.bbc.com/news

您可以看到该网站具有首页,美国大选,冠状病毒等类别.

You can see that website has categories such as Home, US Election, Coronavirus etc.

例如,如果我转到特定的新闻文章,例如: https://www.bbc.com/news/election-us-2020-54912611

For example, If I go to specific news article such as: https://www.bbc.com/news/election-us-2020-54912611

我可以写一个刮板,它会给我标题,这是代码:

I can write a scraper that will give me the headline, this is the code:

from bs4 import BeautifulSoup
    
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
    
title = soup.select("header h1")
print(title)

在此网站上有数百条新闻,所以我的问题是,是否有一种方法可以从首页url访问网站(所有类别)上的多数新闻?在主页上,我看不到所有新闻文章,我只能看到其中的一些,所以有没有办法为整个网站加载整个HTML代码,这样我就可以轻松获得所有新闻标题:

On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:

soup.select("header h1")

推荐答案

好,然后在获得此标题后,您也可以在此页面中拥有另一个链接,您只需再次打开该链接并从该链接中获取信息即可这个:

Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:

visited = set()    
links = [....]
    while links:
         if link_for_fetch in visited:
              continue
         link_for_fetch = links.pop()
         content = get_contents(link_for_fetch)
         headlines += parse_headlines()
         links += parse_links()
         visited.add(link_for_fetch)

这只是伪代码,您可以使用任何编程语言编写.但这可能会花费很多时间来解析整个网站:(并且机器人会阻止您的IP地址

it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address

这篇关于通过美丽的汤访问主网站页面上的所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆