无法从网页中提取带有 pandas 的嵌套表体 [英] Not able to extract nested table body with pandas from webpage

查看:24
本文介绍了无法从网页中提取带有 pandas 的嵌套表体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 url 'http://gsa.nic 中提取嵌套表.in/report/janDhan.html' 使用带有代码的熊猫:

I am trying to extract nested table from the url 'http://gsa.nic.in/report/janDhan.html' using pandas with code:

import pandas as pd
url ="http://gsa.nic.in/report/janDhan.html"
table=pd.read_html(url)[3]
print(table)
table.to_excel("GSA.xlsx")

但是它只打印表格的标题.请指导.我是新手,不想使用beautifulsoup.如果熊猫不能完成预期的任务,那为什么?

However it is printing only header of the table. Please guide.I am a newbie and don't want to use beautifulsoup. If pandas can't do the intended task then why?

推荐答案

该表由 javascript 填充,因此它不在 pandas 正在获取的 HTML 中.您可以通过在浏览器中查看页面源并搜索表中的值(例如PRADESH")来确认这一点.

The table is being populated by javascript, so it is not in the HTML that pandas is fetching. You can confirm this by viewing the source of the page in your browser and searching for values that are in the table, such as "PRADESH."

解决方案是使用诸如 requests-htmlselenium 之类的库来抓取 javascript 呈现的页面.然后你可以用 Pandas 解析 HTML.

The solution is to use a library such as requests-html or selenium to scrape the javascript-rendered page. Then you can parse that HTML with pandas.

from requests_html import HTMLSession

s = HTMLSession()
r = s.get(url)
r.html.render()

table = pd.read_html(r.html)[3]

这篇关于无法从网页中提取带有 pandas 的嵌套表体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆