为什么此代码会生成多个文件?我想要1个包含所有条目的文件 [英] Why does this code generate multiple files? I want 1 file with all entries in it
问题描述
我正在尝试同时使用beautifulsoup和xpath,并尝试使用以下代码,但是现在我每个URL获得1个文件,而不是之前为所有URL获得1个文件的地方
Im trying to work with both beautifulsoup and xpath and was trying to using the following code, but now im getting 1 file per URL instead of before where i was getting 1 file for all the URLS
我刚刚移过CSV的读数以获取URL列表,还刚刚添加了URL和响应的解析..但是当我现在运行此命令时,我得到了很多单独的文件,在某些情况下,实际上可能有1个文件包含2个抓取的页面数据..所以我需要将文件保存移出(缩进)
I just moved over the reading from CSV to get the list of urls and also just added the parsing of the url and response.. but when i run this now i get alot of individual files and in some cases 1 file may actually contain 2 scraped pages data.. so do i need to move my file saving out (indent)
import scrapy
import requests
from DSG2.items import Dsg2Item
from bs4 import BeautifulSoup
import time
import datetime
import csv
class DsgSpider(scrapy.Spider):
name = "dsg"
def start_requests(self):
urlLinks = []
with open('dsgLinks.csv','r') as csvf:
urls = csv.reader(csvf)
for urlLink in urls:
urlLinks.append(urlLink)
for url in urlLinks:
yield scrapy.Request(url=url[0], callback=self.parse)
def parse(self, response):
dets = Dsg2Item()
now = time.mktime(datetime.datetime.now().timetuple())
r = requests.get(response.url, timeout=5)
html = r.text
soup = BeautifulSoup(html, "html.parser")
dets['style'] = " STYLE GOES HERE "
dets['brand'] = " BRAND GOES HERE "
dets['description'] = " DESCRIPTION GOES HERE "
dets['price'] = " PRICE GOES HERE "
dets['compurl'] = response.url[0]
dets['reviewcount'] = " REVIEW COUNT GOES HERE "
dets['reviewrating'] = " RATING COUNT GOES HERE "
dets['model'] = " MODEL GOES HERE "
dets['spechandle'] = " HANDLE GOES HERE "
dets['specbladelength'] = " BLADE LENGTH GOES HERE "
dets['specoveralllength'] = " OVERALL LENGTH GOES HERE "
dets['specweight'] = " WEIGHT GOES HERE "
dets['packsize'] = " PACKSIZE GOES HERE "
for h1items in soup.find_all('h1',class_="product-title"):
strh1item = str(h1items.get_text())
dets['description']=strh1item.lstrip()
for divitems in soup.find_all('div', class_="product-component"):
for ulitems in divitems.find_all('ul'):
for litem in ulitems.find_all('li'):
strlitem = str(litem.get_text())
if 'Model:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['model']=strlitem[bidx:lidx].lstrip()
elif 'Handle:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['spechandle']=strlitem[bidx:lidx].lstrip()
elif 'Blade Length:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['specbladelength'] = strlitem[bidx:lidx].lstrip()
elif 'Overall Length:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['specoveralllength'] = strlitem[bidx:lidx].lstrip()
elif 'Weight:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['specweight'] = strlitem[bidx:lidx].lstrip()
elif 'Pack Qty:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['packsize']=strlitem[bidx:lidx].lstrip()
for litems in soup.find_all('ul', class_="prod-attr-list"):
for litem in litems.find_all('li'):
strlitem = str(litem.get_text())
if 'Style:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['style']=strlitem[bidx:lidx].lstrip()
elif 'Brand:' in strlitem:
bidx = strlitem.index(':')+1
lidx = len(strlitem)
dets['brand']=strlitem[bidx:lidx].lstrip()
for divitems in soup.find_all('div', class_="outofstock-label"):
dets['price'] = divitems.text
for spanitems in soup.find_all('span',class_="final-price"):
for spanitem in spanitems.find_all('span',itemprop="price"):
strspanitem = str(spanitem.get_text())
dets['price'] = '${:,.2f}'.format(float(strspanitem.lstrip()))
for divitems in soup.find_all('div',id="BVRRSummaryContainer"):
for spanitem in divitems.find_all('span',class_="bvseo-reviewCount"):
strspanitem = str(spanitem.get_text())
dets['reviewcount']=strspanitem.lstrip()
for spanitem in divitems.find_all('span',class_="bvseo-ratingValue"):
strspanitem = str(spanitem.get_text())
dets['reviewrating']=strspanitem.lstrip()
filename = 'dsg-%s.csv' % str(int(now))
locallog = open(filename, 'a+')
locallog.write(','.join(map(str, dets.values())) +"\n")
locallog.close()
Id希望修复此代码,因为它现在可以正常工作,以将所有已抓取的数据保存为原来的1个文件.
Id like to fix this code as it works right now to save all the scraped data into 1 file as it was originally.
推荐答案
您为每次运行创建带有时间戳的新文件名:
You create a new filename with timestamp for each run:
文件名='dsg-%s.csv'%str(int(现在))
只需将其替换为:
文件名='dsg.csv'
这篇关于为什么此代码会生成多个文件?我想要1个包含所有条目的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!