使用Python BeautifulSoup对NSE期权价格进行网络爬取,以解决编码校正问题 [英] Webscraping NSE Options prices using Python BeautifulSoup, regarding encoding correction
问题描述
我有:
- 为整个FnO世界实现了完全自动化的分钟级数据收集.
- 自动适应不断变化的FnO世界,退出和新条目.
- 在非市场营业时间关闭.
- 关闭假期,包括新宣布的假期.
- 在年度Muhurat交易数据期间自动启动.
我对网页抓取有点陌生,不习惯使用'tr'&"td"的东西,因此这个疑问.我正在尝试从该线程'https://www.quantinsti.com/blog/option-chain-extraction-for-nse-stocks-using-python'复制Python 3中的Python 2.7代码.
I am a bit new to web scraping and not used to 'tr' & 'td' stuff and thus this doubt. I am trying to replicate this Python 2.7 code in my Python 3 from this thread 'https://www.quantinsti.com/blog/option-chain-extraction-for-nse-stocks-using-python'.
此旧代码使用.ix进行索引,我可以轻松使用.iloc进行更正.但是,行< tr = tr.replace(',','')>显示错误'即使我在< tr = utf_string.encode('utf8')>之前写它,也需要一个类似字节的对象,而不是'str'.
This old code uses .ix for indexing which I can correct using .iloc easily. However, the line <tr = tr.replace(',' , '')> show up error 'a bytes-like object is required, not 'str'' even if I write it before <tr = utf_string.encode('utf8')>.
I have checked this other link from stackoverflow and couldn't solve my problem
我想我已经发现了为什么会这样.这是因为先前用于定义变量tr的for循环.如果我省略了这一行,那么我将获得一个带有数字的DataFrame以及一些附加的文本.我可以使用循环遍历整个DataFrame对此进行过滤,但是更好的方法必须是正确使用replace()函数.我不明白这一点.
I think I have spotted why this is happening. It's because of the previous for loop used previously to define variable tr. If I omit this line, then I get a DataFrame with the numbers with some attached text. I can filter this with a loop over the entire DataFrame, but a better way must be by properly using the replace() function. I can't figure this bit out.
这是我的完整代码.我已经在一行中专门使用##########################标记了我引用的代码的关键部分,以便可以找到该行.快速(甚至通过Ctrl + F键):
Here is my full code. I have marked the critical sections of the code I have referred using ######################### exclusively in a line so that the line can be found out quickly (even by Ctrl + F):
import requests
import pandas as pd
from bs4 import BeautifulSoup
Base_url = ("https://nseindia.com/live_market/dynaContent/"+
"live_watch/option_chain/optionKeys.jsp?symbolCode=2772&symbol=UBL&"+
"symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")
page = requests.get(Base_url)
#page.status_code
#page.content
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
table_it = soup.find_all(class_="opttbldata")
table_cls_1 = soup.find_all(id = "octable")
col_list = []
# Pulling heading out of the Option Chain Table
#########################
for mytable in table_cls_1:
table_head = mytable.find('thead')
try:
rows = table_head.find_all('tr')
for tr in rows:
cols = tr.find_all('th')
for th in cols:
er = th.text
#########################
ee = er.encode('utf8')
col_list.append(ee)
except:
print('no thread')
col_list_fnl = [e for e in col_list if e not in ('CALLS', 'PUTS', 'Chart', '\xc2\xa0')]
#print(col_list_fnl)
table_cls_2 = soup.find(id = "octable")
all_trs = table_cls_2.find_all('tr')
req_row = table_cls_2.find_all('tr')
new_table = pd.DataFrame(index=range(0,len(req_row)-3),columns = col_list_fnl)
row_marker = 0
for row_number, tr_nos in enumerate(req_row):
if row_number <= 1 or row_number == len(req_row)-1:
continue # To insure we only choose non empty rows
td_columns = tr_nos.find_all('td')
# Removing the graph column
select_cols = td_columns[1:22]
cols_horizontal = range(0,len(select_cols))
for nu, column in enumerate(select_cols):
utf_string = column.get_text()
utf_string = utf_string.strip('\n\r\t": ')
#########################
tr = tr.replace(',' , '') # Commenting this out makes code partially work, getting numbers + text attached to the numbers in the table
# That is obtained by commenting out the above line with tr variable & running the entire code.
tr = utf_string.encode('utf8')
new_table.iloc[row_marker,[nu]] = tr
row_marker += 1
print(new_table)
推荐答案
对于第一部分:
er = th.text
应该是 er = th.get_text()
对于后一部分:
看一下,此时的"tr"变量是使用行中的tr
在汤中找到的最后一个tr标签.这意味着您尝试调用replace的tr是可导航的字符串,而不是字符串.
Looking at it, your "tr" variable at this point is the last tr tag found in the soup using for tr in rows
. This means the tr you are trying to call replace on is a navigable string, not a string.
tr = tr.get_text().replace(',','')
应该适用于第一次迭代,但是由于您在第一次迭代中已覆盖它,因此它将在下一次中断迭代.
tr = tr.get_text().replace(',' , '')
should work for the first iteration, however as you have overwritten it in the first iteration it will break in the next iteration.
另外,感谢您提出问题的深度.虽然您并没有提出这个问题,但是非常感谢您花费大量时间描述自己遇到的麻烦以及尝试过的代码.
Additionally, thank you for the depth of your question. While you did not pose it as a question, the length you went to describe the trouble you are having as well as the code you have tried is greatly appreciated.
这篇关于使用Python BeautifulSoup对NSE期权价格进行网络爬取,以解决编码校正问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!