尝试嵌套刮使用BeautifulSoup [英] Attempting a Nested Scrape Using BeautifulSoup

查看:263
本文介绍了尝试嵌套刮使用BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的code是如下:

 < H1>< A NAME =你好>您好< / A>< / H1>
< D​​IV CLASS =colmask>
< D​​IV CLASS =盒子box_1>
< H4>< A>我最喜欢的数字为< / A>< / H4>
< UL><立GT;< A> 1 LT; / A>< /李>< / UL>
< UL><立GT;< A> 2'; / A>< /李>< / UL>
< UL><立GT;< A>第3版; / A>< /李>< / UL>
< UL><立GT;< A> 4℃; / A>< /李>< / UL>
< / DIV>
< D​​IV CLASS =盒子box_2>
< H4>< A>你最喜欢的数字为< / A>< / H4>
< UL><立GT;< A> 1 LT; / A>< /李>< / UL>
< UL><立GT;< A> 2'; / A>< /李>< / UL>
< UL><立GT;< A>第3版; / A>< /李>< / UL>
< UL><立GT;< A> 4℃; / A>< /李>< / UL>
< / DIV>
< / DIV>
< H1 NAME =再见>< A>再见< / A>< / H1>
< D​​IV CLASS =colmask>
< D​​IV CLASS =盒子box_1>
< H4>< A>自己喜欢的号码为< / A>< / H4>
< UL><立GT;< A> 1 LT; / A>< /李>< / UL>
< UL><立GT;< A> 2'; / A>< /李>< / UL>
< UL><立GT;< A>第3版; / A>< /李>< / UL>
< UL><立GT;< A> 4℃; / A>< /李>< / UL>
< / DIV>
< D​​IV CLASS =盒子box_2>
< H4>< A>我最喜欢的数字为< / A>< / H4>
< UL><立GT;< A> 1 LT; / A>< /李>< / UL>
< UL><立GT;< A> 2'; / A>< /李>< / UL>
< UL><立GT;< A>第3版; / A>< /李>< / UL>
< UL><立GT;< A> 4℃; / A>< /李>< / UL>
< / DIV>
< / DIV>

我不循环的code正确,我不知道正确如何遍历,因为我把所有的值组合在一起。有人可能会导致我在正确的轨道上吗?我尝试使用则FindNext() nextSibling()的findAll()方法,但我失败了。

我希望的输出是:

 您好:我最喜欢的号码是:1
您好:我最喜欢的号码是:2
您好:我最喜欢的数字是:3
您好:我最喜欢的数字是:4
您好:您最喜欢的数字是:1
您好:您最喜欢的数字是:2
您好:您最喜欢的数字是:3
您好:您最喜欢的数字是:4
再见:自己喜欢的号码是:1
再见:自己喜欢的号码是:2
再见:他们最喜欢的数字是:3
再见:自己喜欢的号码是:4
再见:我们最喜欢的数字是:1
再见:我们最喜欢的数字是:2
再见:我们最喜欢的数字是:3
再见:我们最喜欢的数字是:4


解决方案

如果您在使用 nextSibling 的问题,那是因为你的HTML实际上是这样的:

 < H1>< A NAME =你好>你好< / A>< / H1> \\ N#< ---换行
< D​​IV CLASS =colmask>

请参阅之后的换行符的< / H1> ?即使一个换行符是无形的,它仍然被认为是文本,因此它成为一个BeautifulSoup元素(NavigableString),它被认为在 nextSibling < H1> 标记。

换行也可以尝试的时候得到的,也就是说,第三子present问题下< D​​IV>

 < D​​IV>
  < D​​IV>&打招呼LT; / DIV>
  < D​​IV>与世界LT; / DIV>
  < D​​IV>再见< / DIV>
< D​​IV>

下面是孩子们的编号:

 < D​​IV> \\ N#< ---换行加上下一行开始空格= 0孩子
  < D​​IV>&打招呼LT; / DIV> \\ N#< - 换行加空格的下一行开始= 2儿童
  < D​​IV>与世界LT; / DIV> \\ N#< - 换行加空格的下一行开始= 4儿童
  < D​​IV>再见< / DIV> \\ N#< - 换行=小孩6
< D​​IV>

的div实际上如果你有麻烦解析HTML儿童数字1,3,和5,则是因为在每行的末尾换行符绊倒您的时间101%。新行总是要考虑并纳入你对那里的东西都位于思维。

要获得< D​​IV> 标记的位置:

 < H1>< A NAME =你好>你好< / A>< / H1> \\ N#< ---换行
< D​​IV CLASS =colmask>

...你可以写:

  h1.nextSibling.nextSibling

但跳过标记之间的所有空格,它更容易使用 findNextSibling(),它允许您指定要查找下一个同级的标签名:

  findNextSibling('DIV')

下面是一个例子:

 从BeautifulSoup进口BeautifulSoup开放('data2.txt')为f:
    HTML = f.read()汤= BeautifulSoup(HTML)在soup.findAll(H1)H1:
    colmask_div = h1.findNextSibling('DIV')    在colmask_div.findAll('格')box_div:
        H4 = box_div.find('H4')        在box_div.findAll(UL)UL:
            打印'{}:{} {}。格式(h1.text,h4.text,ul.li.a.text)--output: -
您好:我最喜欢的号码是:1
您好:我最喜欢的号码是:2
您好:我最喜欢的数字是:3
您好:我最喜欢的数字是:4
您好:您最喜欢的数字是:1
您好:您最喜欢的数字是:2
您好:您最喜欢的数字是:3
您好:您最喜欢的数字是:4
再见:自己喜欢的号码是:1
再见:自己喜欢的号码是:2
再见:他们最喜欢的数字是:3
再见:自己喜欢的号码是:4
再见:我们最喜欢的数字是:1
再见:我们最喜欢的数字是:2
再见:我们最喜欢的数字是:3
再见:我们最喜欢的数字是:4

My code is as follows:

<h1><a name="hello">Hello</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>My Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Your Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
<h1 name="goodbye"><a>Goodbye</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>Their Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Our Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>

I am not looping the code properly and i don't correctly know how to iterate because i keep grouping all the values together. Can someone lead me on the right track? I try using the findNext() , nextSibling(), findAll() methods but i am failing.

The output i am hoping for is:

Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye: Their Favorite Number is: 1
Goodbye: Their Favorite Number is: 2
Goodbye: Their Favorite Number is: 3
Goodbye: Their Favorite Number is: 4
Goodbye: Our Favorite Number is: 1
Goodbye: Our Favorite Number is: 2
Goodbye: Our Favorite Number is: 3
Goodbye: Our Favorite Number is: 4

解决方案

If you were having problems with nextSibling it's because your html actually looks like this:

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

See the newline after the </h1>? Even though a newline is invisible, it is still considered text, and therefore it becomes a BeautifulSoup element(a NavigableString), and it's considered the nextSibling of the <h1> tag.

Newlines can also present problems when trying to get, say, the third child of the following <div>:

<div>
  <div>hello</div>
  <div>world</div>
  <div>goodbye</div>
<div>

Here is the numbering of the children:

<div>\n #<---newline plus spaces at start of next line = child 0
  <div>hello</div>\n #<--newline plus spaces at start of next line = child 2
  <div>world</div>\n #<--newline plus spaces at start of next line = child 4
  <div>goodbye</div>\n #<--newline = child 6
<div>

The divs are actually children numbers 1, 3, and 5. If you are having trouble parsing html, then 101% of the time it's because the newlines at the end of each line are tripping you up. The newlines always have to be accounted for and factored into your thinking about where things are located.

To get the <div> tag here:

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

...you could write:

h1.nextSibling.nextSibling

But to skip ALL the whitespace between tags, it's easier to use findNextSibling(), which allows you to specify the tag name of the next sibling you want to locate:

findNextSibling('div')

Here is an example:

from BeautifulSoup import BeautifulSoup

with open('data2.txt') as f:
    html = f.read()

soup = BeautifulSoup(html)

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)



--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4

这篇关于尝试嵌套刮使用BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆