使用Selenium Python解析HTML并读取HTML表 [英] Parse HTML and Read HTML Table with Selenium Python

查看：648 发布时间：2020/9/20 6:51:46 python selenium web-scraping beautifulsoup rselenium

本文介绍了使用Selenium Python解析HTML并读取HTML表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在将我的一些网页抓取代码从R转换为Python(我无法让geckodriver与R一起使用，但它与Python一起使用).无论如何，我试图了解如何使用Python解析和读取HTML表.快速背景，这是我的R代码:

I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]

我将HTML页面解析为doc对象.然后，我将从doc[[1]]开始，然后遍历更高的数字，直到看到所需的数据为止.在这种情况下，我进入doc[[7]]并看到了我想要的数据.然后，我将读取该HTML表并将其分配给WebElem对象.最终，我将其转换为数据框并开始使用它.

I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.

所以我在Python中所做的是这样:

So what I am doing in Python is this:

html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)

然后我开始玩doc.get_text，但是我真的不知道如何只获取想要查看的数据.我要查看的数据就像一个10x10矩阵.当我使用R时，我将只使用doc[[7]]，该矩阵对于我来说几乎是一个完美的结构，可以将其转换为数据帧.但是，我似乎无法使用Python做到这一点.任何建议将不胜感激.

Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.

更新:

我已经能够使用Python获取我想要的数据-我关注了此博客，以使用python创建数据框:最受欢迎的犬种.在那篇博客文章中，您必须逐步研究元素，创建字典，遍历表的每一行并将数据存储在每一列中，然后才能创建数据框.

I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.

使用R，我唯一需要编写的代码是:

With R, the only code I had to write was:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)

有了这一点，我有了一个非常漂亮的数据框，我只需要调整列名和数据类型-只需使用以下代码即可:

With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:

NULL.V1 NULL.V2 NULL.V3 NULL.V4 1 BREED 2015 2014 2013 2 Retrievers (Labrador) 1 1 1 3 German Shepherd Dogs 2 2 2 4 Retrievers (Golden) 3 3 3 5 Bulldogs 4 4 5 6 Beagles 5 5 4 7 French Bulldogs 6 9 11 8 Yorkshire Terriers 7 6 6 9 Poodles 8 7 8 10 Rottweilers 9 10 9

Python中没有可用的东西使它更简单吗，还是因为R是为数据帧构建的(至少在我看来是这样，但我可能错了)，所以R中的事情只是更简单吗?

Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?

使用Selenium Python解析HTML并读取HTML表 [英] Parse HTML and Read HTML Table with Selenium Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Selenium Python解析HTML并读取HTML表 [英] Parse HTML and Read HTML Table with Selenium Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭