漂亮的汤-抓取后无法创建csv和文本文件 [英] Beautiful soup - Unable to create csv and text files after scraping

查看:79
本文介绍了漂亮的汤-抓取后无法创建csv和文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站的所有页面中提取文章的URL.仅首页中的URL会被重复抓取并存储在csv文件中. 来自这些链接的信息再次以相同的方式被抓取并存储在文本文件中.

I am trying to extract the URL's of articles from all the pages of a website. Only the URLs in the first page are repeatedly scraped and stored in the csv file. The information from these links are again scraped the same way and stored in the text file.

在此问题上需要一些帮助.

Need some help in this issue.

import requests
from bs4 import BeautifulSoup
import csv
import lxml
import urllib2

base_url = 'https://www.marketingweek.com/?s=big+data'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
    search_results = soup.find('div', class_='archive-constraint') #localizing search window with article links
    article_link_tags = search_results.findAll('a') #ordinary scheme goes further 
    res.append([url['href'] for url in article_link_tags])
    #Automatically clicks next button to load other articles
    next_button = soup.find('a', text='>>')
    #Searches for articles till Next button is not found
    if not next_button:
        break
    res.append([url['href'] for url in article_link_tags])
    soup = BeautifulSoup(response.text, "lxml")
    for i in res:
        for j in i:
                print(j)
####Storing scraped links in csv file###

with open('StoreUrl1.csv', 'w+') as f:
    f.seek(0)
    for i in res:
        for j in i:
            f.write('\n'.join(i))


#######Extracting info from URLs########

with open('StoreUrl1.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url), "lxml")

        with open('InfoOutput1.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

推荐答案

使用lxml html解析器的解决方案.

共有361页,每页上有12个链接.我们可以迭代到每个页面并使用xpath提取链接.

There are 361 pages and on each page we have 12 links. We can iterate to each page and extract the links using xpath.

xpath有助于获取:

xpath helps in getting:

  • 带有特定标签的文本
  • 特定标签的值(此处:"a"标签的"href"属性的值)

  • Text under a particular tag
  • Value of particular tag (here: value of 'href' attribute of 'a' tag)

import csv
from lxml import html
from time import sleep
import requests
from random import randint

outputFile = open("All_links.csv", r'wb')
fileWriter = csv.writer(outputFile)

fileWriter.writerow(["Sl. No.", "Page Number", "Link"])

url1 = 'https://www.marketingweek.com/page/'
url2 = '/?s=big+data'

sl_no = 1

#iterating from 1st page through 361th page
for i in xrange(1, 362):

    #generating final url to be scraped using page number
    url = url1 + str(i) + url2

    #Fetching page
    response = requests.get(url)
    sleep(randint(10, 20))
    #using html parser
    htmlContent = html.fromstring(response.content)

    #Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
    page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
    for page_link in page_links:
        fileWriter.writerow([sl_no, i, page_link])
        sl_no += 1

这篇关于漂亮的汤-抓取后无法创建csv和文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆