Python中,解析HTML [英] Python, parsing html

查看:215
本文介绍了Python中,解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于这个网站的用户的那种,我如何使用重新作为替代非标准Python模块使我的脚本将与最小过杭工作的一些想法。今天,我一直在尝试与解析模块。我已经遇到beautifulsoup ..这是所有伟大的,但我不明白。

有关教育目的,我想剥去 HTTP以下信息:// yify-种子。 COM /浏览-电影(请不要告诉我用一个网络爬虫,我并不想抓取整个网站 - !刚刚从这个页面提取信息以了解模块如何解析工作)​​

电影片名
质量
洪流链接

有是这些项目中的22,我希望为他们被存储在列表中顺序,即得。 ITEM_1,ITEM_2。而这些列表需要包含这三个项目。例如:

  ITEM_1 = [詹姆斯·邦德:皇家赌场(2006年),720P,http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent]
ITEM_2 = [完美音调(2012),720P,http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent]

然后,为了让事情简单,我只是想每一个项目打印到控制台。为了让事情变得更加困难,但是,这些项目没有在页面上标识的,所以信息。需要严格排序。这很好,但所有我得到或者是整个源被包含各列表项,或空项目!一个例子分项目情况如下:

 < D​​IV CLASS =浏览-INFO>
    <跨度类=信息>
        < H3>< A HREF =htt​​p://yify-torrents.com/movie/James_Bond_Casino_Royale_2006>詹姆斯·邦德:皇家赌场(2006)LT; / A>< / H3 GT&;
        < P>< B>尺寸:其中; / B> 1018.26 MB< / P>
        < P>< B>质量:< / B> 720P< / P>
        < P>< B>类型:LT; / B>动作|犯罪与LT; / P>
        < P>< B> IMDB评分:< / B> 7.9 / 10下; / P>
            <跨度>
                < p =类同行>< B>同行:LT; / B> 698&所述; / P>
                < p =类同行>< B>种子:LT; / B> 356下; / P>
            < / SPAN>
    < / SPAN>
    <跨度类=链接>
        < A HREF =htt​​p://yify-torrents.com/movie/James_Bond_Casino_Royale_2006级=STD-BTN-小mright>查看信息<跨度>< / SPAN>< / A>
        &所述; A HREF =htt​​p://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent类=STD-BTN小MLEFT torrentDwl数据movieID =2620的数据torrentID =2812&GT ;下载及下;跨度>&下; /跨度>&下; / A>
    < / SPAN>
< / DIV>

任何想法?会有人请我做的。给我如何做到这一点的例子荣誉?我不知道美丽的汤容纳了我所有的要求! PS。对不起,我英文不好,这不是我的第一语言。


解决方案

 从BS4进口BeautifulSoup
进口的urllib2F = urllib2.urlopen('http://yify-torrents.com/browse-movie')
HTML = f.read()
汤= BeautifulSoup(HTML)
在[25]:因为我在soup.findAll(格,{级:浏览-INFO}):
    ...:名称= i.find('A')文本
    ...:在i.findAll X('B'):
    ...:如果x.text ==质量:
    ...:质量= x.parent.text
    ...:链接= i.find('A',{级:STD-BTN-小MLEFT torrentDwl})['href属性]
    ...:打印[名,品质,链接]
    ...:
[u'James邦德:皇家赌场(2006年),u'Quality:720P','http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch完美(2012),u'Quality:720P','http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

或获得正是你想要的输出:

 在[26]:因为我在soup.findAll(格,{级:浏览-INFO}):
    ...:名称= i.find('A')文本
    ...:在i.findAll X('B'):
    ...:如果x.text ==质量:
    ...:质量= x.parent.find(文= TRUE,递归= FALSE).strip()
    ...:链接= i.find('A',{级:STD-BTN-小MLEFT torrentDwl})['href属性]
    ...:打印[名,品质,链接]

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.

For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)

Movie Title Quality Torrent Link

There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:

<div class="browse-info">
    <span class="info">
        <h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
        <p><b>Size:</b> 1018.26 MB</p>
        <p><b>Quality:</b> 720p</p>
        <p><b>Genre:</b> Action | Crime</p>
        <p><b>IMDB Rating:</b> 7.9/10</p>
            <span>
                <p class="peers"><b>Peers:</b> 698</p>
                <p class="peers"><b>Seeds:</b> 356</p>
            </span>
    </span>
    <span class="links">
        <a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
        <a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
    </span> 
</div>

Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.

解决方案

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

or to get exactly the output you wanted:

In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.find(text=True, recursive=False).strip()
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]

这篇关于Python中,解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆