寻找正确的要素来抓取网站 [英] Finding the correct elements for scraping a website

查看:51
本文介绍了寻找正确的要素来抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试仅从主页中抓取某些文章.更具体地说,我尝试仅从子页面媒体和子页面管理委员会的决定新闻发布会货币政策帐户演讲访谈,也包括英语访谈.

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.

我设法(基于一些教程和其他SE:overflow答案)编写了一个代码,该代码可以完全擦除网站上的所有内容,因为我的原始想法是先擦除所有内容,然后在数据框中仅稍后清除输出,但是网站包含太多内容,因此一段时间后总是会冻结.

I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.

获取子链接:

import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ] 
sub_links = {}
for master_atag in master_atags:
    master_href = master_atag.get('href')
    master_href = base_url + master_href
    print(master_href)
    master_links.append(master_href)
    sub_request = requests.get(master_href)
    sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
    sub_atags = sub_soup.find_all("a", href=True)
    sub_links[master_href] = []
    for sub_atag in sub_atags:
        sub_href = sub_atag.get('href')
        sub_links[master_href].append(sub_href)
        print("\t"+sub_href)

我尝试过的一些事情是将基本链接更改为子链接-我的想法是,也许我可以对每个子页面单独进行操作,之后再将链接放在一起,但这是行不通的.我尝试的其他方法是将第17行替换为以下内容;

Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;

sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)

这似乎部分解决了我的问题,因为即使它没有仅从子页面获得链接,但它至少忽略了不是"doc-title"的链接,这些链接都是网站上带有文本的链接,但是仍然太多,并且某些链接未正确检索.

this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.

我尝试了以下方法:

for master_atag in master_atags:
    master_href = master_atag.get('href')
    for href in master_href:
        master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
    print(master_href)

我认为,因为所有带有英文文档的href都在其中包含.en,这只会给我提供所有在href中某个地方出现.en的链接,但是此代码给了我我不理解的print(master_href)语法错误因为以前的print(master_href)起作用了.

I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.

下一步,我想从子链接中提取以下信息.当我对单个链接进行测试时,这部分代码可以工作,但是由于它无法完成运行,因此我从未有机会在上面的代码上尝试过.一旦我设法获得所有链接的正确列表,这项工作会成功吗?

Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?

for links in sublinks:
    resp = requests.get(sublinks)
    soup = BeautifulSoup(resp.content, 'html5lib')
    article = soup.find('article')
    title = soup.find('title')
    textdate = soup.find('h2')
    paragraphs = article.find_all('p')
    matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
        for match in matches:
        print(match[0])
        datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})

也要回过头来,因为第一种使用漂亮汤的方法对我没有用,所以我也尝试以不同的方式处理问题.该网站具有RSS提要,因此我想使用以下代码:

Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:

import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url) 
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()

在这里,我遇到了一个问题,就是甚至无法首先找到RSS feed网址.我尝试了以下操作:我查看了源页面,并尝试搜索"RSS",并尝试了所有可以通过这种方式找到的URL,但是我总是得到空的数据框.

Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.

我是网络爬虫的初学者,目前我不知道如何继续进行或如何解决此问题.最后,我要完成的工作就是从子页面中收集所有带有标题,日期和作者的文章,并将它们放在一个数据框中.

I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.

推荐答案

抓取该网站的最大问题可能是延迟加载:使用JavaScript,它们从多个html页面加载文章并将它们合并到列表中.有关详细信息,请在源代码中查找 index_include .这对于仅使用请求和BeautifulSoup进行抓取是有问题的,因为您的汤实例从请求内容中得到的只是基本框架,而没有文章列表.现在,您有两个选择:

The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:

  1. 使用延迟加载的文章列表,而不是主要的文章列表页面(新闻稿,采访等),例如,/press/pr/date/2019/html/index_include.en.html.这可能是比较容易的选择,但是您必须在有兴趣的每一年都这样做.
  2. 使用可以执行诸如Selenium之类的JavaScript的客户端来获取HTML而不是请求.
  1. Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
  2. Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.

除此之外,我建议使用CSS选择器从HTML代码中提取信息.这样,您只需要为文章做几行.另外,如果您使用 index.en.html 页面进行抓取,我认为您不必过滤英文文章,因为默认情况下它显示的是英语,此外还显示其他语言(如果可用).

Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.

这是我快速整理的一个示例,可以肯定地对其进行优化,但是它显示了如何使用Selenium加载页面并提取文章URL和文章内容:

Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:

from bs4 import BeautifulSoup
from selenium import webdriver

base_url = 'https://www.ecb.europa.eu'
urls = [
    f'{base_url}/press/pr/html/index.en.html',
    f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()

for url in urls:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    for anchor in soup.select('span.doc-title > a[href]'):
        driver.get(f'{base_url}{anchor["href"]}')
        article_soup = BeautifulSoup(driver.page_source, 'html.parser')

        title = article_soup.select_one('h1.ecb-pressContentTitle').text
        date = article_soup.select_one('p.ecb-publicationDate').text
        paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
        content = '\n\n'.join(p.text for p in paragraphs)

        print(f'title: {title}')
        print(f'date: {date}')
        print(f'content: {content[0:80]}...')

我在新闻稿"页面上获得以下输出:

I get the following output for the Press Releases page:

title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board                         
date: 20 December 2019                                    
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...

title: Monetary policy decisions                          
date: 12 December 2019                                    
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

这篇关于寻找正确的要素来抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆