使用Python BeautifulSoup Webscrapping NSE期权价格,有关编码校正 [英] Webscrapping NSE Options prices using Python BeautifulSoup, regarding encoding correction

查看:83
本文介绍了使用Python BeautifulSoup Webscrapping NSE期权价格,有关编码校正的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对网页抓取有点陌生,不习惯使用'tr'& "td"的东西,因此这个疑问.我正在尝试从该线程' https://www.quantinsti.com/blog/option-chain-extraction-for-nse-stocks-using-python '.

I am a bit new to web scraping and not used to 'tr' & 'td' stuff and thus this doubt. I am trying to replicate this Python 2.7 code in my Python 3 from this thread 'https://www.quantinsti.com/blog/option-chain-extraction-for-nse-stocks-using-python'.

此旧代码使用.ix进行索引,我可以轻松使用.iloc进行更正.但是,该行显示错误即使我之前写过,也需要一个类似字节的对象,而不是"str".

This old code uses .ix for indexing which I can correct using .iloc easily. However, the line show up error 'a bytes-like object is required, not 'str'' even if I write it before .

我已经检查了其他

I have checked this other link from stackoverflow and couldn't solve my problem

我想我已经发现了为什么会这样.这是因为先前用于定义变量tr的for循环.如果我省略了这一行,那么我将获得一个带有数字的DataFrame以及一些附加的文本.我可以使用循环遍历整个DataFrame对此进行过滤,但是更好的方法必须是正确使用replace()函数.我不明白这一点.

I think I have spotted why this is happening. It's because of the previous for loop used previously to define variable tr. If I omit this line, then I get a DataFrame with the numbers with some attached text. I can filter this with a loop over the entire DataFrame, but a better way must be by properly using the replace() function. I can't figure this bit out.

这是我的完整代码.我已经在一行中专门使用##########################标记了我引用的代码的关键部分,以便可以找到该行.快速(甚至通过Ctrl + F键):

Here is my full code. I have marked the critical sections of the code I have referred using ######################### exclusively in a line so that the line can be found out quickly (even by Ctrl + F):

import requests
import pandas as pd
from bs4 import BeautifulSoup

Base_url = ("https://nseindia.com/live_market/dynaContent/"+
        "live_watch/option_chain/optionKeys.jsp?symbolCode=2772&symbol=UBL&"+
        "symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")

page = requests.get(Base_url)
#page.status_code
#page.content

soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())

table_it = soup.find_all(class_="opttbldata")
table_cls_1 = soup.find_all(id = "octable")

col_list = []

# Pulling heading out of the Option Chain Table

#########################
for mytable in table_cls_1:
    table_head = mytable.find('thead')

    try:
        rows = table_head.find_all('tr')
        for tr in rows:
            cols = tr.find_all('th')
            for th in cols:
                er = th.text
                #########################
                ee = er.encode('utf8')
                col_list.append(ee)
    except:
        print('no thread')

col_list_fnl = [e for e in col_list if e not in ('CALLS', 'PUTS', 'Chart', '\xc2\xa0')]
#print(col_list_fnl)

table_cls_2 = soup.find(id = "octable")
all_trs = table_cls_2.find_all('tr')
req_row = table_cls_2.find_all('tr')

new_table = pd.DataFrame(index=range(0,len(req_row)-3),columns = col_list_fnl)

row_marker = 0

for row_number, tr_nos in enumerate(req_row):
    if row_number <= 1 or row_number == len(req_row)-1:
        continue # To insure we only choose non empty rows

    td_columns = tr_nos.find_all('td')

    # Removing the graph column
    select_cols = td_columns[1:22]
    cols_horizontal = range(0,len(select_cols))

    for nu, column in enumerate(select_cols):

        utf_string = column.get_text()
        utf_string = utf_string.strip('\n\r\t": ')
        #########################
        tr = tr.replace(',' , '') # Commenting this out makes code partially work, getting numbers + text attached to the numbers in the table

        # That is obtained by commenting out the above line with tr variable & running the entire code.
        tr = utf_string.encode('utf8')

        new_table.iloc[row_marker,[nu]] = tr

    row_marker += 1

print(new_table)

推荐答案

对于第一部分:

er = th.text应该是er = th.get_text()

链接到get_text文档

对于后一部分:

看一下,此时的"tr"变量是使用for tr in rows在汤中找到的最后一个tr标记.这意味着您尝试调用replace的tr是可导航的字符串,而不是字符串.

Looking at it, your "tr" variable at this point is the last tr tag found in the soup using for tr in rows. This means the tr you are trying to call replace on is a navigable string, not a string.

tr = tr.get_text().replace(',' , '')应该适用于第一次迭代,但是由于您在第一次迭代中已将其覆盖,因此它将在下一次迭代中中断.

tr = tr.get_text().replace(',' , '') should work for the first iteration, however as you have overwritten it in the first iteration it will break in the next iteration.

另外,感谢您提出问题的深度.虽然您并没有提出这个问题,但是非常感谢您花费大量时间描述自己遇到的麻烦以及尝试过的代码.

Additionally, thank you for the depth of your question. While you did not pose it as a question, the length you went to describe the trouble you are having as well as the code you have tried is greatly appreciated.

这篇关于使用Python BeautifulSoup Webscrapping NSE期权价格,有关编码校正的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆