在Python HTML文件解析 [英] HTML file parsing in Python

查看：197 发布时间：2016/8/5 18:58:58 python html beautifulsoup nltk

本文介绍了在Python HTML文件解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很长的html文件，看起来完全一样 - HTML文件。我希望能够解析文件，这样我得到了一个元组形式的信息。

例如：

 ＆LT; TR＆GT;
      ＆LT; TD＆GT;＆切赫LT; / TD＆GT;
      ＆LT; TD＆GT;切尔西＆LT; / TD＆GT;
      ＆所述; TD＆GT; 30℃; / TD＆GT;
      ＆LT; TD＆GT;£6.4＆LT; / TD＆GT;
＆LT; / TR＆GT;

以上信息会像（切克，切尔西，30，6.4）。不过，如果你在链接仔细一看我张贴的HTML例子，我来发表下 ＆LT; H2＆GT;＆门将LT; / H＆GT; 标记。我需要这个标签了。所以基本上，结果元组看起来就像（切克，切尔西，30，6.4，门将）。再往下文件一帮球员受到＆LT; H2方式＆gt; 中场，后卫的标签和远期

我试着用beautifulsoup和ntlk库和迷路了。所以，现在我有以下的code：

 进口NLTK
从进口的urllib的urlopenURL =HTTP：//fantasy.$p$pmierleague.com/player-list/
HTML =的urlopen（URL）.read（）
原料= nltk.clean_html（HTML）
打印RAW

刚刚去掉所有标签的HTML文件，并给出了这样的事情

虽然我可以写一个坏块code的读取每一行，并可以将其分配给一个元组。我不能拿出任何解决方案，也可以将球员的位置（在＆LT字符串present; H2＆GT; 标记）。任何解决方案/建议将大大AP preciated。

究其原因，我倾向于使用元组我，这样我可以使用填充MySQL表与解压的值拆包和规划倾斜。

解决方案

 从BS4进口BeautifulSoup
从pprint进口pprint汤= BeautifulSoup（HTML）
H2S = soup.select（H2）＃获取所有H2元素
表= soup.select（表）＃获取所有表首先= TRUE
标题=
玩家= []
对于我，表中历数（表）：
    如果第一：
         #every h2元素有2个表。表大小= 8，H2大小= 4
         #so每2表1 H2
         标题= H2S [INT（I / 2）。文本
    对于在表格。选取TR（TR）：
        玩家=（标题，）＃创建一个播放器
        在tr.select（TD）的TD：
            玩家=玩家+（td.text，）＃将TD在玩家信息
        如果len（播放器）GT; 1：
            ＃如果在TR中含有的球员，其不但（Goalkeaper）添加
            players.append（播放器）
    第一=不是第一
pprint（玩家）

输出

  [（'守门员'，'切赫'，'切尔西'，'30'，'6.4£'），
 （'守门员'，'哈特'，'漫城'，'28'，'£6.4），
 （'守门员'，'克鲁尔'，'纽卡斯尔'，'21'，'£5.0），
 （'守门员'，'赤'，'诺维奇'，'25'，'£5.0），
 （'守门员'，'沃尔姆'，'斯旺西'，'19'，'£5.0'），
 （'守门员'，'斯特克伦堡，富勒姆'，'6'，'£4.9），
 （'守门员'，'Pantilimon，曼城'，'0'，'£4.9），
 （'守门员'，'Lindegaard'，'曼联'，'0'，'£4.9），
 （'守门员'，'Butland，斯托克城'，'0'，'£4.9），
 （'守门员'，'培养'，'西布朗'，'13'，'£4.9），
 （'守门员'，'维维亚诺'，'阿森纳'，'0'，'£4.8'），
 （'守门员'，'施瓦泽'，'切尔西'，'0'，'£4.7'），
 （'守门员'，'博鲁什'，'南安普敦'，'42'，'£4.7'），
 （'守门员'，'迈希尔'，'西布朗'，'15'，'£4.5），
 （'守门员'，'法比安斯基'，'阿森纳'，'0'，'£4.4'），
 （'守门员'，'戈麦斯'，'托特纳姆'，'0'，'£4.4'），
 （'守门员'，'弗里德尔'，'托特纳姆'，'0'，'£4.4'），
 （守门员，恒基，西咸'，'0'，'£4.0），
 （捍卫者，贝恩斯，埃弗顿'，'43'，'£7.7），
 （'捍卫者'，'维尔通根，热刺'，'34'，'£7.0），
 （捍卫者，泰勒，卡迪夫城'，'14'，'£4.5），
 （'捍卫者'，'Zverotic'，'富勒姆'，'0'，'£4.5），
 （'捍卫者'，'戴维斯'，'赫尔城'，'28'，'£4.5），
 （'捍卫者'，'那根'，'利物浦'，'0'，'£4.5），
 （捍卫者，道森'，'西布朗'，'0'，'£3.9），
 （捍卫者，波茨'，'西咸'，'0'，'£3.9），
 （捍卫者，斯彭斯，西咸'，'0'，'£3.9），
 （'中场'，'厄齐尔'，'阿森纳'，'24'，'£10.6'），
 （'中场'，'雷德蒙'，'诺维奇'，'20'，'£5.0'），
 （'中场'，'Mavrias'，'桑德兰'，'5'，'£5.0'），
 （'中场'，'格拉'，'西布朗'，'0'，'£5.0'），
 （'中场'，'埃辛'，'切尔西'，'0'，'£4.9'），
 （'中场'，'布朗'，'西布朗'，'0'，'£4.3），
 （'前锋'，'范佩西'，'曼联'，'24'，'£13.9'），
 （'转发'，'科尼利厄斯'，'夫市'，'1'，'£5.4'），
 （'转发'，'埃尔曼德'，'诺维奇'，'7'，'5.4£'），
 （'前锋'，'穆雷'，'水晶宫'，'0'，'£5.3），
 （'转发'，'Vydra'，'西布朗'，'2'，'£5.3'），
 （'前锋'，'Proschwitz'，'赫尔城'，'0'，'£4.3'）]

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .

Example:

<tr>
      <td>Cech</td>
      <td>Chelsea</td>
      <td>30</td>
      <td>£6.4</td>
</tr>

The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.

I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:

import nltk
from urllib import urlopen

url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw

which just strips of the html file of all the tags and gives something like this:

          Cech
          Chelsea
          30
          £6.4

Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.

The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

解决方案

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables

first = True
title =""
players = []
for i,table in enumerate(tables):
    if first:
         #every h2 element has 2 tables. table size = 8, h2 size = 4
         #so for every 2 tables 1 h2
         title =  h2s[int(i/2)].text
    for tr in table.select("tr"):
        player = (title,) #create a player
        for td in tr.select("td"):
            player = player + (td.text,) #add td info in the player
        if len(player) > 1: 
            #If the tr contains a player and its not only ("Goalkeaper") add it
            players.append(player)
    first = not first
pprint(players)

output:

[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
 ('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
 ('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
 ('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
 ('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
 ('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
 ('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
 ('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
 ('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
 ('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
 ('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
 ('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
 ('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
 ('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
 ('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
 ('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
 ('Defenders', 'Baines', 'Everton', '43', '£7.7'),
 ('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
 ('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
 ('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
 ('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
 ('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
 ('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
 ('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
 ('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
 ('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
 ('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
 ('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
 ('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
 ('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
 ('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
 ('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
 ('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
 ('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
 ('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
 ('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
 ('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

这篇关于在Python HTML文件解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python HTML文件解析 [英] HTML file parsing in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Python HTML文件解析 [英] HTML file parsing in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭