BeautifulSoup：我如何提取所有的＆lt;立GT; S从℃的名单; UL＆GT; S包含一些嵌套＆LT; UL＆GT; S？ [英] BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

查看：143 发布时间：2016/8/5 18:58:00 python screen-scraping beautifulsoup

本文介绍了BeautifulSoup：我如何提取所有的＆lt;立GT; S从℃的名单; UL＆GT; S包含一些嵌套＆LT; UL＆GT; S？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的源$ C $ C如下：

My source code looks like:

<h3>Header3 (Start here)</h3>
<ul>
    <li>List items</li>
    <li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
    <li>List items</li>
    <ul>
        <li>Nested list items</li>
        <li>Nested list items</li></ul>
    <li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>

我想后的第一个H3标签，并在接下来的H2的标签停止，包括所有的嵌套的标签里所有的礼的标签。

I'd like all the "li" tags following the first "h3" tag and stopping at the next "h2" tag, including all nested li tags.

firstH3 = soup.find（'H3'）

firstH3 = soup.find('h3')

正确认定我想开始的地方。

correctly finds the place I'd like to start.

firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
    if nextSibling.name == 'h2':
        break
    if nextSibling.name == 'ul':
        uls.append(nextSibling)

给我的UL列表，每一个我需要的李内容。

gives me a list of ULs, each with LI contents that I need.

ULSLIST摘录：

<ul>
...
    <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
    <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
    <li>Air Bud series:
        <ul>
            <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
            <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
            <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
            <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
        </ul>
    </li>
    <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>

但我不能确定在哪里何去何从。我是一个新手程序员试图通过构建一个擦伤的脚本 http://en.wikipedia.org/wiki/在Python的跳2000s_in_film 并抽取的名单电影标题（年）。

But I'm unsure of where to go from here. I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)".

更新：

最后code：

lis = []
    for ul in uls:
        for li in ul.findAll('li'):
            if li.find('ul'):
                break
            lis.append(li)

    for li in lis:
        print li.text.encode("utf-8")

该如果 - >破抛出包含UL的，因为李的嵌套LI的，现在复制

The If-->break throws out the LI's that contain UL's since the nested LI's are now duplicated.

打印输出现在是：

102斑点狗（2000）

10＆安培;沃尔夫（2006）

11:14（2006）

布加勒斯特12:08东（2006年）

三十姑娘一朵花（2004年）

1408（2007）

感谢

BeautifulSoup：我如何提取所有的＆lt;立GT; S从℃的名单; UL＆GT; S包含一些嵌套＆LT; UL＆GT; S？ [英] BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup：我如何提取所有的＆lt;立GT; S从℃的名单; UL＆GT; S包含一些嵌套＆LT; UL＆GT; S？ [英] BeautifulSoup: How do I extract all the &lt;li&gt;s from a list of &lt;ul&gt;s that contains some nested &lt;ul&gt;s?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

BeautifulSoup：我如何提取所有的＆lt;立GT; S从℃的名单; UL＆GT; S包含一些嵌套＆LT; UL＆GT; S？ [英] BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

登录关闭