漂亮的汤-抓取后无法创建csv和文本文件 [英] Beautiful soup - Unable to create csv and text files after scraping
问题描述
我正在尝试从网站的所有页面中提取文章的URL.仅首页中的URL会被重复抓取并存储在csv文件中. 来自这些链接的信息再次以相同的方式被抓取并存储在文本文件中.
I am trying to extract the URL's of articles from all the pages of a website. Only the URLs in the first page are repeatedly scraped and stored in the csv file. The information from these links are again scraped the same way and stored in the text file.
在此问题上需要一些帮助.
Need some help in this issue.
import requests
from bs4 import BeautifulSoup
import csv
import lxml
import urllib2
base_url = 'https://www.marketingweek.com/?s=big+data'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
search_results = soup.find('div', class_='archive-constraint') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
res.append([url['href'] for url in article_link_tags])
#Automatically clicks next button to load other articles
next_button = soup.find('a', text='>>')
#Searches for articles till Next button is not found
if not next_button:
break
res.append([url['href'] for url in article_link_tags])
soup = BeautifulSoup(response.text, "lxml")
for i in res:
for j in i:
print(j)
####Storing scraped links in csv file###
with open('StoreUrl1.csv', 'w+') as f:
f.seek(0)
for i in res:
for j in i:
f.write('\n'.join(i))
#######Extracting info from URLs########
with open('StoreUrl1.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
with open('InfoOutput1.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
推荐答案
使用lxml html解析器的解决方案.
共有361页,每页上有12个链接.我们可以迭代到每个页面并使用xpath提取链接.
There are 361 pages and on each page we have 12 links. We can iterate to each page and extract the links using xpath.
xpath有助于获取:
xpath helps in getting:
- 带有特定标签的文本
-
特定标签的值(此处:"a"标签的"href"属性的值)
- Text under a particular tag
Value of particular tag (here: value of 'href' attribute of 'a' tag)
import csv
from lxml import html
from time import sleep
import requests
from random import randint
outputFile = open("All_links.csv", r'wb')
fileWriter = csv.writer(outputFile)
fileWriter.writerow(["Sl. No.", "Page Number", "Link"])
url1 = 'https://www.marketingweek.com/page/'
url2 = '/?s=big+data'
sl_no = 1
#iterating from 1st page through 361th page
for i in xrange(1, 362):
#generating final url to be scraped using page number
url = url1 + str(i) + url2
#Fetching page
response = requests.get(url)
sleep(randint(10, 20))
#using html parser
htmlContent = html.fromstring(response.content)
#Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
for page_link in page_links:
fileWriter.writerow([sl_no, i, page_link])
sl_no += 1
这篇关于漂亮的汤-抓取后无法创建csv和文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!