我的工作仅抓取最后一页,而不是全部抓取 [英] My job scrapes only the last page instead of all of them
问题描述
我的抓取作业似乎只将CSV写入了网站的最后一页.我认为这是因为它循环遍历所有页面,然后写入到csv.它确实会刮擦元素并将其打印在控制台中.您是否必须循环浏览并立即为每个页面写入csv,因为它无法存储数据?我曾尝试调整代码以适应这种情况,但似乎无法使其正常工作.
My scraping job only seems to write to CSV the last page of the website. I assume this is because it is looping through all pages and then writes to the csv. It does scrape the elements and prints them in the console. Do you have to loop through and write to csv for each page straight away as it cannot store the data? I have tried adjusting my code to accommodate this but I can't seem to get it to work.
谢谢.
我也尝试了不同的方法,但是 https://www.pastebin .ca/3863340
I have also tried a different menthod but the same thing appears to be happening in https://www.pastebin.ca/3863340
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import csv
import requests
import time
from selenium import webdriver
from random import shuffle
import csv
driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://www.bookmaker.com.au/sports/soccer/')
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
time.sleep(1)
elements = driver.find_elements_by_css_selector(".market-match:nth-child(2) .market-group a , .market-match:nth-child(1) .market-group a")
elem_href1 = [element.get_attribute("href") for element in elements]
print(elem_href1)
print (len(elem_href1))
shuffle(elem_href1)
for link in elem_href1:
driver.get(link)
...
time.sleep(2)
# link
elems = driver.find_elements_by_css_selector("h3 a[Href*='/sports/soccer']")
elem_href = []
for elem in elems:
print(elem.get_attribute("href"))
elem_href.append(elem.get_attribute("href"))
# TEAM
langs = driver.find_elements_by_css_selector(".row:nth-child(1) td:nth-child(1)")
langs_text = []
for lang in langs:
print(lang.text)
langs_text.append(lang.text)
time.sleep(0)
# odds
langs1 = driver.find_elements_by_css_selector("a.odds.quickbet")
langs1_text = []
for lang in langs1:
print(lang.text)
langs1_text.append(lang.text)
time.sleep(0)
with open('vtg12.csv', 'a', newline='') as outfile:
writer = csv.writer(outfile)
for row in zip(langs1_text, langs_text, elem_href):
writer.writerow(row)
推荐答案
问题是您每次迭代都覆盖CSV,因此脚本结束时仅保留最后一条记录.
The problem is that you are overwriting the CSV every single iteration and hence only last record remains when the script ends.
更改
with open('vtg12.csv', 'a', newline='') as outfile:
writer = csv.writer(outfile)
for row in zip(langs1_text, langs_text, elem_href):
writer.writerow(row)
到
with open('vtg12.csv', 'a+', newline='') as outfile:
writer = csv.writer(outfile)
for row in zip(langs1_text, langs_text, elem_href):
writer.writerow(row)
a+
将以附加模式打开文件
a+
will open the file in append mode
这篇关于我的工作仅抓取最后一页,而不是全部抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!