我的工作仅抓取最后一页,而不是全部抓取 [英] My job scrapes only the last page instead of all of them

查看:78
本文介绍了我的工作仅抓取最后一页,而不是全部抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的抓取作业似乎只将CSV写入了网站的最后一页.我认为这是因为它循环遍历所有页面,然后写入到csv.它确实会刮擦元素并将其打印在控制台中.您是否必须循环浏览并立即为每个页面写入csv,因为它无法存储数据?我曾尝试调整代码以适应这种情况,但似乎无法使其正常工作.

My scraping job only seems to write to CSV the last page of the website. I assume this is because it is looping through all pages and then writes to the csv. It does scrape the elements and prints them in the console. Do you have to loop through and write to csv for each page straight away as it cannot store the data? I have tried adjusting my code to accommodate this but I can't seem to get it to work.

谢谢.

我也尝试了不同的方法,但是 https://www.pastebin .ca/3863340

I have also tried a different menthod but the same thing appears to be happening in https://www.pastebin.ca/3863340

from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import csv
    import requests
    import time
    from selenium import webdriver
    from random import shuffle
    import csv

driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()

driver.get('https://www.bookmaker.com.au/sports/soccer/')

SCROLL_PAUSE_TIME = 0.5


last_height = driver.execute_script("return document.body.scrollHeight")

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")


    time.sleep(SCROLL_PAUSE_TIME)


    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

time.sleep(1)

elements = driver.find_elements_by_css_selector(".market-match:nth-child(2) .market-group a , .market-match:nth-child(1) .market-group a")
elem_href1 = [element.get_attribute("href") for element in elements]
print(elem_href1)
print (len(elem_href1))
shuffle(elem_href1)
for link in elem_href1:
    driver.get(link)
    ...
    time.sleep(2)

    # link
    elems = driver.find_elements_by_css_selector("h3 a[Href*='/sports/soccer']")
    elem_href = []
    for elem in elems:
     print(elem.get_attribute("href"))
     elem_href.append(elem.get_attribute("href"))

    # TEAM
    langs = driver.find_elements_by_css_selector(".row:nth-child(1) td:nth-child(1)")
    langs_text = []

    for lang in langs:
        print(lang.text)
        langs_text.append(lang.text)

    time.sleep(0)

    # odds
    langs1 = driver.find_elements_by_css_selector("a.odds.quickbet")
    langs1_text = []

    for lang in langs1:
        print(lang.text)
        langs1_text.append(lang.text)

    time.sleep(0)

    with open('vtg12.csv', 'a', newline='') as outfile:
        writer = csv.writer(outfile)
        for row in zip(langs1_text, langs_text, elem_href):
            writer.writerow(row)

推荐答案

问题是您每次迭代都覆盖CSV,因此脚本结束时仅保留最后一条记录.

The problem is that you are overwriting the CSV every single iteration and hence only last record remains when the script ends.

更改

with open('vtg12.csv', 'a', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in zip(langs1_text, langs_text, elem_href):
        writer.writerow(row)

with open('vtg12.csv', 'a+', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in zip(langs1_text, langs_text, elem_href):
        writer.writerow(row)

a+将以附加模式打开文件

a+ will open the file in append mode

这篇关于我的工作仅抓取最后一页,而不是全部抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆