CNN Scraper偶尔在python中工作 [英] CNN Scraper sporadically working in python

查看：58 发布时间：2020/9/20 8:10:43 python selenium web-scraping beautifulsoup selenium-chromedriver

本文介绍了CNN Scraper偶尔在python中工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图为CNN创建一个Web抓取工具.我的目标是在搜索查询中删除所有新闻报道.有时我会得到一些已抓取页面的输出，有时却根本无法正常工作.

I've tried to create a Web Scraper for CNN. My goal is to scrap all news articles within the search query. Sometimes I get an output for some of the scraped pages and sometimes it doesn't work at all.

我在Jupiter Notebook中使用硒和BeautifulSoup软件包.我正在通过url参数&page={}&from={}遍历页面.我之前尝试过.XPATH，只是单击页面末尾的next按钮，但是它给了我相同的结果.

I am using selenium and BeautifulSoup packages in Jupiter Notebook. I am iterating over the pages via the url parameters &page={}&from={}. I tried by.XPATH before and simply clicking the next button at the end of the page, but it gave me the same results.

这是我正在使用的代码:

Here's the code I'm using:

#0 ------------import libraries
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import feedparser
import urllib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pickle
import pandas as pd

#3 ------------CNN SCRAPER
#3.1 ----------Define Funktion
def CNN_Scraper(max_pages):
    base = "https://edition.cnn.com/"
    browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    load_content = browser.implicitly_wait(30)
    base_url = 'https://edition.cnn.com/search?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100'

 #-------------Define empty lists to be scraped
    CNN_title   = []
    CNN_date   = []
    CNN_article   = []
    article_count = 0


 #-------------iterate over pages and extract   
    for page in range(1, max_pages + 1):
        print("Page %d" % page)

        url= base_url + "&page=%d&from=%d" % (page, article_count)
        browser.get(url)
        load_content
        soup = BeautifulSoup(browser.page_source,'lxml')
        search_results = soup.find('div', {'class':'cnn-search__results-list'})
        contents = search_results.find_all('div', {'class':'cnn-search__result-contents'})

        for content in contents:
            try:
                title = content.find('h3').text
                print(title)
                link = content.find('a')
                link_url = link['href']    

                date = content.find('div',{'class':'cnn-search__result-publish-date'}).text.strip()
                article = content.find('div',{'class':'cnn-search__result-body'}).text
            except:
                print("loser")
                continue
            CNN_title.append(title)
            CNN_date.append(date)
            CNN_article.append(article)

        article_count += 100   
        print("-----")

 #-------------Save in DF    
    df = pd.DataFrame()
    df['title'] = CNN_title
    df['date'] = CNN_date      
    df['article'] = CNN_article 
    df['link']=CNN_link
    return df        

    #print("Complete")

    browser.quit()

#3.2 ----------Call Function - Scrape CNN and save pickled data
CNN_data = CNN_Scraper(2)
#CNN_data.to_pickle("CNN_data")

CNN Scraper偶尔在python中工作 [英] CNN Scraper sporadically working in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

CNN Scraper偶尔在python中工作 [英] CNN Scraper sporadically working in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭