在Python HTML文件解析 [英] HTML file parsing in Python

查看:197
本文介绍了在Python HTML文件解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很长的html文件,看起来完全一样 - HTML文件。我希望能够解析文件,这样我得到了一个元组形式的信息。

例如:

 < TR>
      < TD>&切赫LT; / TD>
      < TD>切尔西< / TD>
      &所述; TD> 30℃; / TD>
      < TD>£6.4< / TD>
< / TR>

以上信息会像(切克,切尔西,30,6.4)。不过,如果你在链接仔细一看我张贴的HTML例子,我来发表下 < H2>&门将LT; / H> 标记。我需要这个标签了。所以基本上,结果元组看起来就像(切克,切尔西,30,6.4,门将)。再往下文件一帮球员受到< H2方式> 中场,后卫的标签和远期

我试着用beautifulsoup和ntlk库和迷路了。所以,现在我有以下的code:

 进口NLTK
从进口的urllib的urlopenURL =HTTP://fantasy.$p$pmierleague.com/player-list/
HTML =的urlopen(URL).read()
原料= nltk.clean_html(HTML)
打印RAW

刚刚去掉所有标签的​​HTML文件,并给出了这样的事情

 切赫
          切尔西
          三十
          £6.4

虽然我可以写一个坏块code的读取每一行,并可以将其分配给一个元组。我不能拿出任何解决方案,也可以将球员的位置(在&LT字符串present; H2> 标记)。任何解决方案/建议将大大AP preciated。

究其原因,我倾向于使用元组我,这样我可以使用填充MySQL表与解压的值拆包和规划倾斜。


解决方案

 从BS4进口BeautifulSoup
从pprint进口pprint汤= BeautifulSoup(HTML)
H2S = soup.select(H2)#获取所有H2元素
表= soup.select(表)#获取所有表首先= TRUE
标题=
玩家= []
对于我,表中历数(表):
    如果第一:
         #every h2元素有2个表。表大小= 8,H2大小= 4
         #so每2表1 H2
         标题= H2S [INT(I / 2)。文本
    对于在表格。选取TR(TR):
        玩家=(标题,)#创建一个播放器
        在tr.select(TD)的TD:
            玩家=玩家+(td.text,)#将TD在玩家信息
        如果len(播放器)GT; 1:
            #如果在TR中含有的球员,其不但(Goalkeaper)添加
            players.append(播放器)
    第一=不是第一
pprint(玩家)

输出

  [('守门员','切赫','切尔西','30','6.4£'),
 ('守门员','哈特','漫城','28','£6.4),
 ('守门员','克鲁尔','纽卡斯尔','21','£5.0),
 ('守门员','赤','诺维奇','25','£5.0),
 ('守门员','沃尔姆','斯旺西','19','£5.0'),
 ('守门员','斯特克伦堡,富勒姆','6','£4.9),
 ('守门员','Pantilimon,曼城','0','£4.9),
 ('守门员','Lindegaard','曼联','0','£4.9),
 ('守门员','Butland,斯托克城','0','£4.9),
 ('守门员','培养','西布朗','13','£4.9),
 ('守门员','维维亚诺','阿森纳','0','£4.8'),
 ('守门员','施瓦泽','切尔西','0','£4.7'),
 ('守门员','博鲁什','南安普敦','42','£4.7'),
 ('守门员','迈希尔','西布朗','15','£4.5),
 ('守门员','法比安斯基','阿森纳','0','£4.4'),
 ('守门员','戈麦斯','托特纳姆','0','£4.4'),
 ('守门员','弗里德尔','托特纳姆','0','£4.4'),
 (守门员,恒基,西咸','0','£4.0),
 (捍卫者,贝恩斯,埃弗顿','43','£7.7),
 ('捍卫者','维尔通根,热刺','34','£7.0),
 (捍卫者,泰勒,卡迪夫城','14','£4.5),
 ('捍卫者','Zverotic','富勒姆','0','£4.5),
 ('捍卫者','戴维斯','赫尔城','28','£4.5),
 ('捍卫者','那根','利物浦','0','£4.5),
 (捍卫者,道森','西布朗','0','£3.9),
 (捍卫者,波茨','西咸','0','£3.9),
 (捍卫者,斯彭斯,西咸','0','£3.9),
 ('中场','厄齐尔','阿森纳','24','£10.6'),
 ('中场','雷德蒙','诺维奇','20','£5.0'),
 ('中场','Mavrias','桑德兰','5','£5.0'),
 ('中场','格拉','西布朗','0','£5.0'),
 ('中场','埃辛','切尔西','0','£4.9'),
 ('中场','布朗','西布朗','0','£4.3),
 ('前锋','范佩西','曼联','24','£13.9'),
 ('转发','科尼利厄斯','夫市','1','£5.4'),
 ('转发','埃尔曼德','诺维奇','7','5.4£'),
 ('前锋','穆雷','水晶宫','0','£5.3),
 ('转发','Vydra','西布朗','2','£5.3'),
 ('前锋','Proschwitz','赫尔城','0','£4.3')]

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .

Example:

<tr>
      <td>Cech</td>
      <td>Chelsea</td>
      <td>30</td>
      <td>£6.4</td>
</tr>

The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.

I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:

import nltk
from urllib import urlopen

url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw

which just strips of the html file of all the tags and gives something like this:

          Cech
          Chelsea
          30
          £6.4

Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.

The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

解决方案

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables

first = True
title =""
players = []
for i,table in enumerate(tables):
    if first:
         #every h2 element has 2 tables. table size = 8, h2 size = 4
         #so for every 2 tables 1 h2
         title =  h2s[int(i/2)].text
    for tr in table.select("tr"):
        player = (title,) #create a player
        for td in tr.select("td"):
            player = player + (td.text,) #add td info in the player
        if len(player) > 1: 
            #If the tr contains a player and its not only ("Goalkeaper") add it
            players.append(player)
    first = not first
pprint(players)

output:

[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
 ('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
 ('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
 ('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
 ('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
 ('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
 ('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
 ('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
 ('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
 ('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
 ('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
 ('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
 ('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
 ('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
 ('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
 ('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
 ('Defenders', 'Baines', 'Everton', '43', '£7.7'),
 ('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
 ('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
 ('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
 ('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
 ('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
 ('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
 ('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
 ('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
 ('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
 ('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
 ('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
 ('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
 ('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
 ('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
 ('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
 ('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
 ('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
 ('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
 ('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
 ('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

这篇关于在Python HTML文件解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆