Python-Web Scraping HTML表并打印为CSV [英] Python - Web Scraping HTML table and printing to CSV

查看:91
本文介绍了Python-Web Scraping HTML表并打印为CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,但是我正在寻找一个网络抓取工具,该工具将在线提取HTML表格中的数据并将其以相同格式打印为CSV.

I'm pretty much brand new to Python, but I'm looking to build a webscraping tool that will rip data from an HTML table online and print it into a CSV in the same format.

这是HTML表的示例(这是巨大的,因此我将仅提供几行).

Here's a sample of the HTML table (it's enormous, so I'm going to provide only a few rows).

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

我对重新创建具有相同列标题的表特别感兴趣:日期",打开",高",低",关闭",交易量",市值".目前,我已经能够编写一个简单的脚本,该脚本实际上将转到URL,下载HTML,使用BeautifulSoup进行解析,然后使用"for"语句获取td元素.下面是我的代码示例(省略了网址)和结果:

I'm particularly interested in recreating the table with the same column headers provided: "Date," "Open," "High," "Low," "Close," "Volume," "Market Cap." Currently, I've been able to write a simple script that will essentially go to the URL, download the HTML, parse with BeautifulSoup, and then use 'for' statements to get the td elements. Below a sample of my code (URL omitted) and the result:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = {
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []
}

soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

示例输出

任何人都对如何至少重新格式化拉入表的数据有任何指示?谢谢.

Anyone have any pointers on how to at least reformat the data pull into the table? Thanks.

推荐答案

运行代码,您将从该表中获取所需的数据.要尝试从该元素中提取数据,您需要做的就是将您上面粘贴的整个html元素包装在html=''' '''

Run the code and you will get your desired data from that table. To give it a go and extract the data from this very element, all you need to do is wrap the whole html element, which you have pasted above, within html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

我试图将代码分成几部分,以使您理解.我在上面所做的是一个嵌套的for循环.这是分开的过程:

I've tried to break the code into pieces to make you understand. What I did above is a nested for loop. Here is how it goes separately:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

结果:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

这篇关于Python-Web Scraping HTML表并打印为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆