使用BeautifulSoup 4网页中提取表 [英] Extracting tables from a webpage using BeautifulSoup 4

查看:578
本文介绍了使用BeautifulSoup 4网页中提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要宽容,只能用beautifulSoup今天处理这个问题开始了。

Be forgiving, only started using beautifulSoup today to deal with this problem.

我已经成功地得到它的工作通过在URL中的拖动网站上,每本网站的产品页面中有一个表,如下所示:

I've managed to get it working by dragging in the URL's on the website, Each of the product pages on this website has a table that looks like the following:

<table width="100%" class="product-feature-table">
  <tbody>
    <tr>
      <td align="center"><table cellspacing="0" class="stats2">
        <tbody>
          <tr>
          <td class="hed" colspan="2">YYF Shutter Stats:</td>
          </tr>
          <tr>
          <td>Diameter:</td>
          <td>56 mm / 2.20 inches</td>
          </tr>
          <tr>
            <td>Width:</td>
            <td>44.40 mm / 1.74 inches</td>
          </tr>
          <tr>
            <td>Gap Width:</td>
            <td>4.75 mm / .18 inches</td>
          </tr>
          <tr>
            <td>Weight:</td>
            <td>67.8 grams</td>
          </tr>
          <tr>
            <td>Bearing Size:</td>
            <td>Size C (.250 x .500 x .187)<br>CBC SPEC Bearing</td>
          </tr>
          <tr>
            <td>Response:</td>
            <td>CBC Silicone Slim Pad (19mm)</td>
          </tr>
        </tbody>
        </table>
      <br>
      <br>
      </td>
    </tr>
  </tbody>
</table>

我想这个表拉成某种形式的数据,我可以与web应用程序中工作。

I'm trying to pull this table into some form of data that I could work with within a webapp.

我怎么会去从每个网页提取这一点,该网站拥有约400产品页面中包含此表,我想preferably喜欢从页面获取每个表,并把它变成一个数据库条目或文本文件,该产品的名称

How would I go about extracting this from each webpage, the website has around 400 product pages that include this table, I'd preferably like to get each of the tables from the page and put it into a database entry or text file with the name of the product.

正如你所看到的是不完全格式化的好表,但它是标有页面上唯一表

As you can see the table isn't exactly formatted well, but it is the only table on the page labeled with

class="product-feature-table"

我刚才一直在试图编辑URL刮剧本,但我开始让我去了解这一切错误试图做到这一点的感觉。

I have just been trying to edit a URL scraping script but I'm starting to get the feeling I'm going about it all wrong trying to do that.

我的网址脚本如下:

import urllib2
from bs4 import BeautifulSoup

url = raw_input('Web-Address: ')

html = urllib2.urlopen('http://' +url).read()
soup = BeautifulSoup(html)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

我可以得到所有这些的URL到一个文本文件,但会大大preFER来使用SQLite或PostgreSQL,有没有网上的文章,这将有助于我理解这些概念比较好,不淹死的新手?

I can get all these URL's into a text file but would much prefer to use Sqlite or Postgresql, are there any articles online that would help me understand these concepts better, that don't drown the newbie?

推荐答案

首先,如果你想提取使用BeautifulSoup一个网站内的所有表,你可以这样做以下列方式:

First of all, if you want to extract all the tables inside a site using BeautifulSoup you could do it in the following way :

import urllib2
from bs4 import BeautifulSoup

url = raw_input('Web-Address: ')

html = urllib2.urlopen('http://' +url).read()
soup = BeautifulSoup(html)
soup.prettify()

# extract all the tables in the HTML 
tables = soup.find_all('table')

#get the class name for each
for table in tables:
  class_name = table['class']

一旦你在页面中的所有表,你可以做你想要用它的数据移动的标签 TR 和任何的 TD 以下列方式:

for table in tables:
  tr_tags = table.find_all('tr')

请记住,在 TR 标签是在表内的行。然后获得里面的数据标签 TD 您可以使用类似这样的:

Remember that the tr tags are rows inside the table. Then to obtain the data inside the tags td you could use something like this :

for table in tables:
  tr_tags = table.find_all('tr')

  for tr in tr_tags:
    td_tags = tr.find_all('td')

    for td in td_tags:
      text = td.string  

如果您想在表格内的所有链接上网冲浪,然后找到code上面说明会为你的工作表,首先使所有的URL里面的那么它们之间移动检索。例如:

If you want to surf in all the links inside the table and then find the tables the code explained above would work for you, making first the retrieve of all the urls inside an then moving between them. For example :

initial_url = 'URL'
list_of_urls = []

list_of_url.append(initial_url)

while len(list_of_urls) > 0:

  html = urllib2.urlopen('http://' + list_of_url.pop()).read()
  soup = BeautifulSoup(html)
  soup.prettify()

  for anchor in soup.find_all('a', href=True):
     list_of_urls.append(anchor['href'])

  #here put the code explained above, for example

  for table in tables:
    class_name = table['class']

    # continue with the above code..

要插入数据到数据库SQLite中,我建议你阅读下面的教程
的Python:简单一步步骤的SQLite教程

To insert the data to a database in SQLite I recommend you read the following tutorial Python: A Simple Step-by-Step SQLite Tutorial

这篇关于使用BeautifulSoup 4网页中提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆