Python - 网页抓取 HTML 表格并打印到 CSV [英] Python - Web Scraping HTML table and printing to CSV

查看:30
本文介绍了Python - 网页抓取 HTML 表格并打印到 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎是 Python 的新手,但我正在寻找构建一个网页抓取工具,该工具可以在线抓取 HTML 表格中的数据并将其打印为相同格式的 CSV.

这是一个 HTML 表格示例(它非常庞大,所以我将只提供几行).

<div id="history-data" class="tab-pane active"><div class="tab-header"><h2 class="pull-left bottom-margin-2x">比特币的历史数据</h2><div class="clear"></div><div class="row"><div class="col-md-12"><div class="pull-left"><small>美元货币</small>

<div id="reportrange" class="pull-right"><i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;<span>2017 年 8 月 16 日 - 2017 年 9 月 15 日</span><b class="caret"></b>

<table class="table"><头><tr><th class="text-left">日期</th><th class="text-right">打开</th><th class="text-right">高</th><th class="text-right">低</th><th class="text-right">关闭</th><th class="text-right">音量</th><th class="text-right">市值</th></tr></thead><tr class="text-right"><td class="text-left">2017 年 9 月 14 日</td><td>3875.37</td><td>3920.60</td><td>3153.86</td><td>3154.95</td><td>2,716,310,000</td><td>64,191,600,000</td></tr><tr class="text-right"><td class="text-left">2017 年 9 月 13 日</td><td>4131.98</td><td>4131.98</td><td>3789.92</td><td>3882.59</td><td>2,219,410,000</td><td>68,432,200,000</td></tr><tr class="text-right"><td class="text-left">2017 年 9 月 12 日</td><td>4168.88</td><td>4344.65</td><td>4085.22</td><td>4130.81</td><td>1,864,530,000</td><td>69,033,400,000</td></tr></tbody>

我对重新创建具有相同列标题的表格特别感兴趣:日期"、开盘价"、最高价"、最低价"、收盘价"、成交量"、市值".目前,我已经能够编写一个简单的脚本,该脚本基本上可以访问 URL、下载 HTML、使用 BeautifulSoup 进行解析,然后使用for"语句来获取 td 元素.下面是我的代码示例(省略了 URL)和结果:

from bs4 import BeautifulSoup进口请求将熊猫导入为 pd导入 csvurl = "输入URLhere"页面 = requests.get(url)pagetext = page.text价格表 = {日期" : [],打开" : [],高的" : [],低的" : [],关闭" : [],体积" : [],市值":[]}汤 = BeautifulSoup(pagetext, 'html.parser')file = open("test.csv", 'w')对于soup.find_all('tr') 中的行:对于 row.find_all('td') 中的 col:打印(col.text)

示例输出

有人对如何至少重新格式化拉入表中的数据有任何指示吗?谢谢.

解决方案

运行代码,您将从该表中获得所需的数据.为了试一试并从这个元素中提取数据,您需要做的就是将您在上面粘贴的整个 html 元素包装在 html=''' '''>

导入csv从 bs4 导入 BeautifulSoupoutfile = open("table_data.csv","w",newline='')writer = csv.writer(输出文件)树 = BeautifulSoup(html,"lxml")table_tag = tree.select("table")[0]tab_data = [[item.text for item in row_data.select("th,td")]对于 table_tag.select("tr")] 中的 row_data对于 tab_data 中的数据:writer.writerow(数据)打印(' '.join(数据))

我已尝试将代码分解为多个部分以使您理解.我上面做的是一个嵌套的 for 循环.这是分开的方式:

from bs4 import BeautifulSoup汤 = BeautifulSoup(html,"lxml")table = 汤.find('table')list_of_rows = []对于 table.findAll('tr') 中的行:list_of_cells = []对于 row.findAll(["th","td"]) 中的单元格:文本 = 单元格.文本list_of_cells.append(文本)list_of_rows.append(list_of_cells)对于 list_of_rows 中的项目:打印(' '.join(item))

结果:

Date Open High Low Close Volume Market Cap2017 年 9 月 14 日 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,0002017 年 9 月 13 日 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,0002017 年 9 月 12 日 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

I'm pretty much brand new to Python, but I'm looking to build a webscraping tool that will rip data from an HTML table online and print it into a CSV in the same format.

Here's a sample of the HTML table (it's enormous, so I'm going to provide only a few rows).

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

I'm particularly interested in recreating the table with the same column headers provided: "Date," "Open," "High," "Low," "Close," "Volume," "Market Cap." Currently, I've been able to write a simple script that will essentially go to the URL, download the HTML, parse with BeautifulSoup, and then use 'for' statements to get the td elements. Below a sample of my code (URL omitted) and the result:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = {
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []
}

soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

sample output

Anyone have any pointers on how to at least reformat the data pull into the table? Thanks.

解决方案

Run the code and you will get your desired data from that table. To give it a go and extract the data from this very element, all you need to do is wrap the whole html element, which you have pasted above, within html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

I've tried to break the code into pieces to make you understand. What I did above is a nested for loop. Here is how it goes separately:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

Result:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

这篇关于Python - 网页抓取 HTML 表格并打印到 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆