如何使我的网络抓取脚本更强大? [英] How do I make my web scraping script more robust?

查看:56
本文介绍了如何使我的网络抓取脚本更强大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我启动了一段代码来删除桑坦德网站.

I launched a code to scrap the Santander website.

除我得到错误的结果外,抓取似乎是可行的.当我连续两次运行代码时,结果将发生变化.

Scraping seems to work, except that I get false results. And when I run the code twice in a row, the results change.

如何使抓取更健壮,问题在于,当我运行代码并逐一检查结果时,它似乎运行良好.

How could I make the scraping more robust, the problem is that when I run the code and check the results one by one, it seems to work well.

def hw_santander_scrap(Amount, Duration):
  from selenium import webdriver
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')
  chrome_options.add_argument('--start-maximized')
  chrome_options.add_argument('window-size=10000x5000')
  webdriver = webdriver.Chrome('chromedriver', chrome_options = chrome_options)

  #
  import time
  maintenant = DT.now()
  period = str(maintenant.day) + '_' + str(maintenant.month) + '_' + str(maintenant.year)
  print('Start Scraping')

  ################################################ Santander###############################################

  Santander = pd.DataFrame({
      'Project': "reforma vivienda",
      'Period': period,
      'Monthly repayment': [0],
      'TIN': [0],
      'TAE': [0],
      'Total repayment': [0],
      'Initial amount': [0],
      'Duration': [0]
  })

  project = pd.DataFrame({
      'Project': "reforma vivienda",
      'Period': period,
      'Monthly repayment': [0],
      'TIN': [0],
      'TAE': [0],
      'Total repayment': [0],
      'Initial amount': [0],
      'Duration': [0]
  })
  url = 'https://simuladores.bancosantander.es/SantanderES/loansimulatorweb.aspx?por=webpublica&prv=publico&m=300&cta=1&ls=0#/t0'
  webdriver.get(url)

  Max_amount = 90.000
  Min_amount = 3.000
  for i in range(len(Amount)):
    Simulated_amount = Amount[i]
    if Simulated_amount > Max_amount:
      pass
    elif Simulated_amount < Min_amount:
      pass
    else :
      amount = WebDriverWait(webdriver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#amount")))
      amount.clear()
      amount.send_keys("{:.3f}".format(Simulated_amount))
      WebDriverWait(webdriver, 30).until(lambda webdriver: webdriver.execute_script('return jQuery.active') == 0)
      for j in range(len(Duration)):
        Simulated_duration = Duration[j]
        Simulated_duration = round(int(Simulated_duration))
        Max_duration = 96
        Min_duration = 12
        if Simulated_duration > Max_duration:
          pass
        elif Simulated_duration < Min_duration:
          pass
        else :
          term = WebDriverWait(webdriver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#term")))
        term.clear()
        term.send_keys("{}".format(Simulated_duration))
        term.send_keys(Keys.TAB)
        webdriver.save_screenshot('screenshot_santander.png')
        project.loc[j, 'Project'] = "reforma vivienda"
        project.loc[j, 'Initial amount'] = float("{:.3f}".format(Amount[i]).replace('.', ''))
        project.loc[j, 'Duration'] = Simulated_duration
        project.loc[j, 'Period'] = str(maintenant.day) + '/' + str(maintenant.month) + '/' + str(maintenant.year)
        project.loc[j, 'Monthly repayment'] = webdriver.find_element_by_css_selector('.r1 span').text.replace(' €', '').replace(',', '.')
        project.loc[j, 'TIN'] = float(webdriver.find_element_by_css_selector('.r3 span').text[6: 10].replace(',', '.'))
        project.loc[j, 'TAE'] = float(webdriver.find_element_by_css_selector('.r3 span').text[13: 17].replace(',', '.'))
        project.loc[j, 'Total repayment'] = float(webdriver.find_element_by_css_selector('.r7 span').text.replace(' €', '').replace('.', '').replace(',', '.'))
      Santander = Santander.append(project)
  Santander = Santander.loc[Santander.TIN != 0,: ]
  Santander.to_csv('Santander_{}.csv'.format(period), index = False)
print('End Scraping')

运行代码:

Amount = [13.000, 14.000, 15.000, 30.000, 45.000, 60.000]
Duration = [12, 15, 24, 36, 48, 60, 72, 84, 96]
hw_santander_scrap(Amount, Duration)

推荐答案

该数据来自XHR.因此,只需使用请求发布您的值并使用json.loads

That data come from a XHR. So just use requests to post your values and parse the response with json.loads

使用浏览器的网络"标签查看请求的外观.

Use your browser network tab to see what the request looks like.

这篇关于如何使我的网络抓取脚本更强大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆