Python解析JavaScript生成的HTML表 [英] Python Parsing HTML Table Generated by JavaScript

查看：176 发布时间：2018/6/22 20:32:56 javascript python html pandas beautifulsoup

本文介绍了Python解析JavaScript生成的HTML表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从纽约证券交易所网站（ http：// www1 .nyse.com / about / listed / IPO_Index.html ）转换为熊猫数据框。为了做到这一点，我有这样的设置：

  def htmltodf（url）：
 page = requests .get（url）
 soup = BeautifulSoup（page.text）
 
 tables = soup.findAll（'table'）
 test = pandas.io.html.read_html（str （table））
 
 return（test）#return dataframe type object

但是，当我在页面上运行这个时，列表中返回的所有表都基本上是空的。当我进一步调查时，我发现表格是由javascript生成的。在我的Web浏览器中使用开发人员工具时，我发现表格看起来像其他带有标签的HTML表格等。但是，源代码视图显示了类似这样的内容：

 < script language =JavaScript> 
 
。 
。 
。 
 
< script> 
 var year = [[ICC，21st Century Oncology Holdings，Inc。，2014年5月22日，/ about / listed / icc.html]，
 ...更多条目here ... 
，[ZOES，Zoe's Kitchen，Inc。，2014年4月11日，/ about / listed / zoes.html]]; 
 
 if（year.length！= 0）
 {
 
 document.write（< table width ='619'border ='0'cellspacing = '0'cellpadding ='0'>< tr>< td>< span class ='fontbold'>）; 
 document.write（'2014'+IPO Showcase）; 
 document.write（< / span>< / td>< / tr>< / table>）; 
} 
< / script>

因此，当我的HTML解析器去寻找表标签时，它所能找到的就是if条件，并且下面没有适当的标签可以指示内容。我该如何刮这张桌子？有没有可以搜索的标签，而不是可以显示内容的表格？由于该代码不是传统的html表格形式，因此如何使用pandas来读取它 - 我是否必须手动分析数据？ 解决方案在这种情况下，您需要为您运行该JavaScript代码。

这里的一个选择是使用 selenium ：

  from pandas.io.html import read_html 
 from selenium import webdriver 
 
 
 driver = webdriver.Firefox（）
 driver.get（'http://www1.nyse.com/about/listed/IPO_Index.html '）
 
 table = driver.find_element_by_xpath（'// div [@ class =sp5] / table // table / ..'）
 table_html = table.get_attribute（'innerHTML '）
 
 df = read_html（table_html）[0] 
 print df 
 
 driver.close（）

打印：

  0 1 2 3 
 0名称符号NaT NaN 
 1 Performance Sports Group Ltd. PSG 2014-06-20 NaN 
 2 Century Communities，Inc. CCS 2014-06-18 NaN 
 3 Foresight Energy Partners LP FELP 2014-06-18澳元
 ... 
 79 EGShares TCW EM长期投资级别Bon ... LEMF 2014-01-08 
 80 EGShares TCW EM Short期货投资级别波动... SEMF 2014-01-08 NaN 
 
 [81行x 4栏]

I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:

def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))

return(test)            #return dataframe type object

However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:

<script language="JavaScript">

.
.
.

<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May  2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;

    if(year.length != 0) 
    {   

    document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
    document.write ('2014' + " IPO Showcase"); 
    document.write ("</span></td></tr></table>"); 
    }  
</script>

Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?

解决方案

In this case, you need something to run that javascript code for you.

One option here would be to use selenium:
from pandas.io.html import read_html from selenium import webdriver driver = webdriver.Firefox() driver.get('http://www1.nyse.com/about/listed/IPO_Index.html') table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..') table_html = table.get_attribute('innerHTML') df = read_html(table_html)[0] print df driver.close()
prints:
0 1 2 3 0 Name Symbol NaT NaN 1 Performance Sports Group Ltd. PSG 2014-06-20 NaN 2 Century Communities, Inc. CCS 2014-06-18 NaN 3 Foresight Energy Partners LP FELP 2014-06-18 NaN ... 79 EGShares TCW EM Long Term Investment Grade Bon... LEMF 2014-01-08 NaN 80 EGShares TCW EM Short Term Investment Grade Bo... SEMF 2014-01-08 NaN [81 rows x 4 columns]

这篇关于Python解析JavaScript生成的HTML表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python解析JavaScript生成的HTML表 [英] Python Parsing HTML Table Generated by JavaScript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python解析JavaScript生成的HTML表 [英] Python Parsing HTML Table Generated by JavaScript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭