在Python HTML文件解析 [英] HTML file parsing in Python
问题描述
我有一个很长的html文件,看起来完全一样 - HTML文件。我希望能够解析文件,这样我得到了一个元组形式的信息。
例如:
< TR>
< TD>&切赫LT; / TD>
< TD>切尔西< / TD>
&所述; TD> 30℃; / TD>
< TD>£6.4< / TD>
< / TR>
以上信息会像(切克,切尔西,30,6.4)
。不过,如果你在链接仔细一看我张贴的HTML例子,我来发表下 < H2>&门将LT; / H>
标记。我需要这个标签了。所以基本上,结果元组看起来就像(切克,切尔西,30,6.4,门将)
。再往下文件一帮球员受到< H2方式>
中场,后卫的标签和远期
我试着用beautifulsoup和ntlk库和迷路了。所以,现在我有以下的code:
进口NLTK
从进口的urllib的urlopenURL =HTTP://fantasy.$p$pmierleague.com/player-list/
HTML =的urlopen(URL).read()
原料= nltk.clean_html(HTML)
打印RAW
刚刚去掉所有标签的HTML文件,并给出了这样的事情
切赫
切尔西
三十
£6.4
虽然我可以写一个坏块code的读取每一行,并可以将其分配给一个元组。我不能拿出任何解决方案,也可以将球员的位置(在&LT字符串present; H2>
标记)。任何解决方案/建议将大大AP preciated。
究其原因,我倾向于使用元组我,这样我可以使用填充MySQL表与解压的值拆包和规划倾斜。
从BS4进口BeautifulSoup
从pprint进口pprint汤= BeautifulSoup(HTML)
H2S = soup.select(H2)#获取所有H2元素
表= soup.select(表)#获取所有表首先= TRUE
标题=
玩家= []
对于我,表中历数(表):
如果第一:
#every h2元素有2个表。表大小= 8,H2大小= 4
#so每2表1 H2
标题= H2S [INT(I / 2)。文本
对于在表格。选取TR(TR):
玩家=(标题,)#创建一个播放器
在tr.select(TD)的TD:
玩家=玩家+(td.text,)#将TD在玩家信息
如果len(播放器)GT; 1:
#如果在TR中含有的球员,其不但(Goalkeaper)添加
players.append(播放器)
第一=不是第一
pprint(玩家)
输出
[('守门员','切赫','切尔西','30','6.4£'),
('守门员','哈特','漫城','28','£6.4),
('守门员','克鲁尔','纽卡斯尔','21','£5.0),
('守门员','赤','诺维奇','25','£5.0),
('守门员','沃尔姆','斯旺西','19','£5.0'),
('守门员','斯特克伦堡,富勒姆','6','£4.9),
('守门员','Pantilimon,曼城','0','£4.9),
('守门员','Lindegaard','曼联','0','£4.9),
('守门员','Butland,斯托克城','0','£4.9),
('守门员','培养','西布朗','13','£4.9),
('守门员','维维亚诺','阿森纳','0','£4.8'),
('守门员','施瓦泽','切尔西','0','£4.7'),
('守门员','博鲁什','南安普敦','42','£4.7'),
('守门员','迈希尔','西布朗','15','£4.5),
('守门员','法比安斯基','阿森纳','0','£4.4'),
('守门员','戈麦斯','托特纳姆','0','£4.4'),
('守门员','弗里德尔','托特纳姆','0','£4.4'),
(守门员,恒基,西咸','0','£4.0),
(捍卫者,贝恩斯,埃弗顿','43','£7.7),
('捍卫者','维尔通根,热刺','34','£7.0),
(捍卫者,泰勒,卡迪夫城','14','£4.5),
('捍卫者','Zverotic','富勒姆','0','£4.5),
('捍卫者','戴维斯','赫尔城','28','£4.5),
('捍卫者','那根','利物浦','0','£4.5),
(捍卫者,道森','西布朗','0','£3.9),
(捍卫者,波茨','西咸','0','£3.9),
(捍卫者,斯彭斯,西咸','0','£3.9),
('中场','厄齐尔','阿森纳','24','£10.6'),
('中场','雷德蒙','诺维奇','20','£5.0'),
('中场','Mavrias','桑德兰','5','£5.0'),
('中场','格拉','西布朗','0','£5.0'),
('中场','埃辛','切尔西','0','£4.9'),
('中场','布朗','西布朗','0','£4.3),
('前锋','范佩西','曼联','24','£13.9'),
('转发','科尼利厄斯','夫市','1','£5.4'),
('转发','埃尔曼德','诺维奇','7','5.4£'),
('前锋','穆雷','水晶宫','0','£5.3),
('转发','Vydra','西布朗','2','£5.3'),
('前锋','Proschwitz','赫尔城','0','£4.3')]
I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4)
. However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2>
tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers)
. Further down the file a bunch of players come under <h2>
tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2>
tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]
这篇关于在Python HTML文件解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!