使用 pandas 读取下载的HTML文件 [英] Using pandas to read downloaded html file
问题描述
作为标题,我尝试使用read_html
,但出现以下错误:
As title, I tried using read_html
but give me the following error:
In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
我做错了什么?
HTML的顶部包含一些javascript,然后是html表.我使用R来处理它,方法是通过XML包解析html来给我一个数据帧.我想用python做它,在将它提供给熊猫之前,我还应该使用诸如beautifulsoup之类的东西吗?
The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?
推荐答案
我认为您可以通过使用html解析器(如漂亮的汤)来走上正确的轨道. pandas.read_html()读取html表而不是html页面.
I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.
您想做这样的事情...
You would want to do something like this...
from bs4 import BeautifulSoup
import pandas as pd
table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
#otherwise try str(table) as input
这篇关于使用 pandas 读取下载的HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!