Python的webscraping - NoneObeject失败 - 破碎的HTML? [英] Python webscraping - NoneObeject Failure - broken HTML?

查看:144
本文介绍了Python的webscraping - NoneObeject失败 - 破碎的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

香港专业教育学院有一个问题,在我的python脚本解析。香港专业教育学院尝试了已经处于另一页(雅虎财经),它工作得很好。晨星不过其不工作。
我得到的错误表变量的终端NoneObject。我猜它与moriningstar网站的结构做的,但我不确定。 Maybey somne​​one能告诉我什么地方出了错。
抑或是不可能的,因为晨星网站的sitestructure用我简单的脚本?

一个简单的CSV出口晨星直接的是不是一个解决方案,因为我想用的脚本这不具有此功能的其他网站。

 进口要求
导入CSV
从BS4进口BeautifulSoup
从LXML导入HTMLURL ='http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US
响应= requests.get(URL)
HTML = response.content汤= BeautifulSoup(HTML)
表= soup.find('表',ATTRS = {'类':'r_table1文本2'})打印表格。prettify()#debugginglist_of_rows = []
在table.findAll('TR')行:
   list_of_cells = []   在row.findAll细胞(['日','TD'):
        文本= cell.text.replace('和; NBSP;','')
        list_of_cells.append(文本)
   list_of_rows.append(list_of_cells)
打印list_of_rows #debuggingOUTFILE =打开(./ test.csv,WB)
作家= csv.writer(OUTFILE)
writer.writerows(list_of_rows)


解决方案

此表的动态加载的一个单独的XHR呼叫端点这将返回 JSONP 响应。模拟这一请求,从中提取JSONP响应JSON字符串,用 JSON 加载它,从中提取的HTML componentData 键和负载 BeautifulSoup

 进口JSON
进口重进口要求
从BS4进口BeautifulSoup# 发出请求
URL = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
响应= requests.get(URL)#提取HTML下的componentData
数据= json.loads(应用re.sub(R([A-ZA-Z_0-9 \\。] * \\()|(\\); $)','',response.content))[componentData ]#解析HTML
汤= BeautifulSoup(数据,html.parser)
表= soup.find('表',ATTRS = {'类':'r_table1文本2'})
打印(表。prettify())

Ive got a problem with my parsing script in python. Ive tried it already at another page (yahoo-Finance) and it worked fine. On morningstar nevertheless its not working. I get the Error in the terminal "NoneObject" of the table variable. I guess it has to do with the structure of the moriningstar site, but iḿ not sure. Maybey somneone can tell me what went wrong. Or is it not possible because of the sitestructure of the Morningstar site to use my simple script?

A simple csv export direct from morningstar is not a solution because I would like to use the script for other sites which dont have this functionality.

import requests
import csv
from bs4 import BeautifulSoup
from lxml import html

url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})

print table.prettify() #debugging

list_of_rows = []
for row in table.findAll('tr'):
   list_of_cells =[]

   for cell in row.findAll(['th','td']):
        text = cell.text.replace(' ', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)
print list_of_rows #debugging

outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

解决方案

The table is dynamically loaded with a separate XHR call to an endpoint which would return JSONP response. Simulate that request, extract the JSON string from the JSONP response, load it with json, extract the HTML from the componentData key and load with BeautifulSoup:

import json
import re

import requests
from bs4 import BeautifulSoup

# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)

# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]

# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())

这篇关于Python的webscraping - NoneObeject失败 - 破碎的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆