大型HTML网页上的Python网络抓取 [英] Python web scraping on large html webpages

查看：97 发布时间：2020/9/20 8:21:50 python-3.x web-scraping beautifulsoup urllib2 urllib

本文介绍了大型HTML网页上的Python网络抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从Yahoo Finance获取特定股票的所有历史信息.我是python和网络抓取的新手.

I am trying to get all the historical information of a particular stock from yahoo finance. I am new to python and web-scraping.

我想将所有历史数据下载到CSV文件中. 问题在于该代码仅下载网站上任何股票的前100个条目.在浏览器上查看任何库存时，我们必须滚动到页面底部以加载更多表条目.

I want to download all the historical data into a CSV file. The problem is that the code downloads only the first 100 entries of any stock on the website. When any stock is viewed on the browser, we have to scroll to the bottom of the page for more table entries to load.

我认为使用该库下载时会发生同样的事情.某种优化似乎阻止了网页完全下载.在这里尝试(

I think the same thing is happening when I download using the library. Some kind of optimization seems to be preventing the web page from downloading entirely. Try it out here (https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d). Is there a way to overcome this?

这是我的代码:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url= 'https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d'

page=uReq(my_url)

page_html = page.read()
page_data = soup(page_html,"html.parser")
container= page_data.findAll("table",{"data-test":"historical-prices"})
container= container[0].tbody
rows=container.findAll("tr")
filename="tvs.csv"
f=open(filename,"w")
headers = "date, open, low, close, adjusted_close_price, vol \n"
f.write(headers)


for row in rows:
    if len(row.find_all("td",{"colspan":""}))==7 :
        col=row.findAll("td")
        date=col[0].span.text.strip()
        opend=col[1].span.text.strip().replace(",","")
        if opend!='null':       
            high=col[2].span.text.strip().replace(",","")
            low=col[3].span.text.strip().replace(",","")
            close=col[4].span.text.strip().replace(",","")
            adjclose=col[5].span.text.strip().replace(",","")
            vol=col[6].span.text.strip().replace(",","")
            f.write(date+","+opend+","+low+","+close+","+adjclose+","+vol+","+"\n")

f.close();

预先感谢！

好的，我发现了另一段运行良好的代码.但是我不知道它是如何工作的.任何帮助将不胜感激.

Okay , I found another piece of code that works well. But I have no idea how it works. Any help would be appreciated.

#!/usr/bin/env python

"""
get-yahoo-quotes.py:  Script to download Yahoo historical quotes using the new cookie authenticated site.
 Usage: get-yahoo-quotes SYMBOL
 History
 06-03-2017 : Created script
"""

__author__ = "Brad Luicas"
__copyright__ = "Copyright 2017, Brad Lucas"
__license__ = "MIT"
__version__ = "1.0.0"
__maintainer__ = "Brad Lucas"
__email__ = "brad@beaconhill.com"
__status__ = "Production"


import re
import sys
import time
import datetime
import requests


def split_crumb_store(v):
    return v.split(':')[2].strip('"')


def find_crumb_store(lines):
    # Looking for
    # ,"CrumbStore":{"crumb":"9q.A4D1c.b9
    for l in lines:
        if re.findall(r'CrumbStore', l):
            return l
    print("Did not find CrumbStore")


def get_cookie_value(r):
    return {'B': r.cookies['B']}


def get_page_data(symbol):
    url = "https://finance.yahoo.com/quote/%s/?p=%s" % (symbol, symbol)
    r = requests.get(url)
    cookie = get_cookie_value(r)

    # Code to replace possible \u002F value
    # ,"CrumbStore":{"crumb":"FWP\u002F5EFll3U"
    # FWP\u002F5EFll3U
    lines = r.content.decode('unicode-escape').strip(). replace('}', '\n')
    return cookie, lines.split('\n')


def get_cookie_crumb(symbol):
    cookie, lines = get_page_data(symbol)
    crumb = split_crumb_store(find_crumb_store(lines))
    return cookie, crumb


def get_data(symbol, start_date, end_date, cookie, crumb):
    filename = '%s.csv' % (symbol)
    url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumb)
    response = requests.get(url, cookies=cookie)
    with open (filename, 'wb') as handle:
        for block in response.iter_content(1024):
            handle.write(block)


def get_now_epoch():
    # @see https://www.linuxquestions.org/questions/programming-9/python-datetime-to-epoch-4175520007/#post5244109
    return int(time.time())


def download_quotes(symbol):
    start_date = 0
    end_date = get_now_epoch()
    cookie, crumb = get_cookie_crumb(symbol)
    get_data(symbol, start_date, end_date, cookie, crumb)


if __name__ == '__main__':
    # If we have at least one parameter go ahead and loop overa all the parameters assuming they are symbols
    if len(sys.argv) == 1:
        print("\nUsage: get-yahoo-quotes.py SYMBOL\n\n")
    else:
        for i in range(1, len(sys.argv)):
            symbol = sys.argv[i]
            print("--------------------------------------------------")
            print("Downloading %s to %s.csv" % (symbol, symbol))
            download_quotes(symbol)
print("--------------------------------------------------")

大型HTML网页上的Python网络抓取 [英] Python web scraping on large html webpages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

大型HTML网页上的Python网络抓取 [英] Python web scraping on large html webpages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭