使用Selenium Python解析HTML并读取HTML表 [英] Parse HTML and Read HTML Table with Selenium Python

查看:648
本文介绍了使用Selenium Python解析HTML并读取HTML表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将我的一些网页抓取代码从R转换为Python(我无法让geckodriver与R一起使用,但它与Python一起使用).无论如何,我试图了解如何使用Python解析和读取HTML表.快速背景,这是我的R代码:

I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]

我将HTML页面解析为doc对象.然后,我将从doc[[1]]开始,然后遍历更高的数字,直到看到所需的数据为止.在这种情况下,我进入doc[[7]]并看到了我想要的数据.然后,我将读取该HTML表并将其分配给WebElem对象.最终,我将其转换为数据框并开始使用它.

I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.

所以我在Python中所做的是这样:

So what I am doing in Python is this:

html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)

然后我开始玩doc.get_text,但是我真的不知道如何只获取想要查看的数据.我要查看的数据就像一个10x10矩阵.当我使用R时,我将只使用doc[[7]],该矩阵对于我来说几乎是一个完美的结构,可以将其转换为数据帧.但是,我似乎无法使用Python做到这一点.任何建议将不胜感激.

Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.

更新:

我已经能够使用Python获取我想要的数据-我关注了此博客,以使用python创建数据框:最受欢迎的犬种.在那篇博客文章中,您必须逐步研究元素,创建字典,遍历表的每一行并将数据存储在每一列中,然后才能创建数据框.

I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.

使用R,我唯一需要编写的代码是:

With R, the only code I had to write was:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)

有了这一点,我有了一个非常漂亮的数据框,我只需要调整列名和数据类型-只需使用以下代码即可:

With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:

NULL.V1 NULL.V2 NULL.V3 NULL.V4 1 BREED 2015 2014 2013 2 Retrievers (Labrador) 1 1 1 3 German Shepherd Dogs 2 2 2 4 Retrievers (Golden) 3 3 3 5 Bulldogs 4 4 5 6 Beagles 5 5 4 7 French Bulldogs 6 9 11 8 Yorkshire Terriers 7 6 6 9 Poodles 8 7 8 10 Rottweilers 9 10 9

NULL.V1 NULL.V2 NULL.V3 NULL.V4 1 BREED 2015 2014 2013 2 Retrievers (Labrador) 1 1 1 3 German Shepherd Dogs 2 2 2 4 Retrievers (Golden) 3 3 3 5 Bulldogs 4 4 5 6 Beagles 5 5 4 7 French Bulldogs 6 9 11 8 Yorkshire Terriers 7 6 6 9 Poodles 8 7 8 10 Rottweilers 9 10 9

Python中没有可用的东西使它更简单吗,还是因为R是为数据帧构建的(至少在我看来是这样,但我可能错了),所以R中的事情只是更简单吗?

Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?

推荐答案

好吧,经过一番深入的挖掘,我觉得我找到了一个很好的解决方案-与R匹配.如果您正在查看链接中提供的HTML,上面的狗的品种,为该链接运行的Web驱动程序,您可以运行以下代码:

Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:

tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')

df = pd.read_html(tbl)

然后,仅需几行代码,您就会看到一个非常漂亮的数据框:

Then you are looking a pretty nice dataframe after only a couple lines of code:

In [145]: df Out[145]: [ 0 1 2 3 0 BREED 2015 2014 2013.0 1 Retrievers (Labrador) 1 1 1.0 2 German Shepherd Dogs 2 2 2.0 3 Retrievers (Golden) 3 3 3.0 4 Bulldogs 4 4 5.0 5 Beagles 5 5 4.0

In [145]: df Out[145]: [ 0 1 2 3 0 BREED 2015 2014 2013.0 1 Retrievers (Labrador) 1 1 1.0 2 German Shepherd Dogs 2 2 2.0 3 Retrievers (Golden) 3 3 3.0 4 Bulldogs 4 4 5.0 5 Beagles 5 5 4.0

我觉得这比处理标签,创建字典和遍历博客所建议的每一行数据要容易得多.这可能不是最正确的处理方式,我是Python的新手,但它可以快速完成工作.我希望这可以帮助一些网络爬虫.

I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.

这篇关于使用Selenium Python解析HTML并读取HTML表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆