CNN Scraper偶尔在python中工作 [英] CNN Scraper sporadically working in python

查看:58
本文介绍了CNN Scraper偶尔在python中工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图为CNN创建一个Web抓取工具.我的目标是在搜索查询中删除所有新闻报道.有时我会得到一些已抓取页面的输出,有时却根本无法正常工作.

I've tried to create a Web Scraper for CNN. My goal is to scrap all news articles within the search query. Sometimes I get an output for some of the scraped pages and sometimes it doesn't work at all.

我在Jupiter Notebook中使用硒和BeautifulSoup软件包.我正在通过url参数&page={}&from={}遍历页面.我之前尝试过.XPATH,只是单击页面末尾的next按钮,但是它给了我相同的结果.

I am using selenium and BeautifulSoup packages in Jupiter Notebook. I am iterating over the pages via the url parameters &page={}&from={}. I tried by.XPATH before and simply clicking the next button at the end of the page, but it gave me the same results.

这是我正在使用的代码:

Here's the code I'm using:

#0 ------------import libraries
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import feedparser
import urllib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pickle
import pandas as pd

#3 ------------CNN SCRAPER
#3.1 ----------Define Funktion
def CNN_Scraper(max_pages):
    base = "https://edition.cnn.com/"
    browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    load_content = browser.implicitly_wait(30)
    base_url = 'https://edition.cnn.com/search?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100'

 #-------------Define empty lists to be scraped
    CNN_title   = []
    CNN_date   = []
    CNN_article   = []
    article_count = 0


 #-------------iterate over pages and extract   
    for page in range(1, max_pages + 1):
        print("Page %d" % page)

        url= base_url + "&page=%d&from=%d" % (page, article_count)
        browser.get(url)
        load_content
        soup = BeautifulSoup(browser.page_source,'lxml')
        search_results = soup.find('div', {'class':'cnn-search__results-list'})
        contents = search_results.find_all('div', {'class':'cnn-search__result-contents'})

        for content in contents:
            try:
                title = content.find('h3').text
                print(title)
                link = content.find('a')
                link_url = link['href']    

                date = content.find('div',{'class':'cnn-search__result-publish-date'}).text.strip()
                article = content.find('div',{'class':'cnn-search__result-body'}).text
            except:
                print("loser")
                continue
            CNN_title.append(title)
            CNN_date.append(date)
            CNN_article.append(article)

        article_count += 100   
        print("-----")

 #-------------Save in DF    
    df = pd.DataFrame()
    df['title'] = CNN_title
    df['date'] = CNN_date      
    df['article'] = CNN_article 
    df['link']=CNN_link
    return df        

    #print("Complete")

    browser.quit()

#3.2 ----------Call Function - Scrape CNN and save pickled data
CNN_data = CNN_Scraper(2)
#CNN_data.to_pickle("CNN_data")

推荐答案

直接调用后端API.有关更多详细信息,请查看我以前的回答

Call the back-end API directly. For more details check my previous answer

import requests
import json


def main(url):
    with requests.Session() as req:
        for item in range(1, 1000, 100):
            r = req.get(url.format(item)).json()
            for a in r['result']:
                print("Headline: {}, Url: {}".format(
                    a['headline'], a['url']))


main("https://search.api.cnn.io/content?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100&from={}")

这篇关于CNN Scraper偶尔在python中工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆