BeautifulSoup抢只有一次给定的标签内 [英] BeautifulSoup grab only once within a given tag

查看:289
本文介绍了BeautifulSoup抢只有一次给定的标签内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓住一个父标签,如果它内部包含一个标记,假设标记。因此,例如,我有:

 < A>
 < B>
  &所述c取代;
  MARKER
  < / c取代;
 < / B>
 < B>
  &所述c取代;
  MARKER
  MARKER
  < / c取代;
 < / B>
 < B>
  &所述c取代;
  东东
  < / c取代;
 < / B>
&所述; / A>

我想抓住:

 < B>
  &所述c取代;
  MARKER
  < / c取代;
 < / B> < B>
  &所述c取代;
  MARKER
  MARKER
  < / c取代;
 < / B>

我目前的code是:

 在soup.find_all东西(文= re.compile(标记)):
        后= stuff.find_parent(B)

这工作,排序,然而,它给我:

 < B>
  &所述c取代;
  MARKER
  < / c取代;
 < / B> < B>
  &所述c取代;
  MARKER
  MARKER
  < / c取代;
 < / B> < B>
  &所述c取代;
  MARKER
  MARKER
  < / c取代;
 < / B>

这是发生这种情况的原因是显而易见的,它的打印包含标记一次为每发现MARKER整,所以包含两个标记标签会打印两次。不过,我不知道如何如何告诉BeautifulSoup不要定标签内进行搜索,它的完成后,(我怀疑,特别是不能做?)或prevent这一点,其他的或许比索引一切字典和拒绝复制?

编辑:
这是我在那工作的给我的麻烦,因为出于某种原因,上面并没有真正尽管是一个精简版本产生错误的特殊情况。 (具体的论坛主题,我在获取一出戏逐后,如果任何人的好奇。)

 从BS4进口BeautifulSoup
进口urllib.request里
进口重URL ='http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179
汤= urllib.request.urlopen(URL).read()
sbsoup = BeautifulSoup(汤)在sbsoup.find_all东西(文= re.compile(\\ [XX] \\])):
        后= stuff.find_parent(礼)
        打印(post.find(A类_ =用户名)。字符串)
        打印(post.find(块引用,类_ =MessageText中UGC baseHtml)。get_text())


解决方案

我写这与BS3,它可能会与BS4工作,但概念是相同的。基本上,李标签有下的数据作者属性,他们的用户名,这样你就不会需要找到一个较低的标签,然后寻找父里。

看来你只有在包含标记块引用标记感兴趣的话,那么为什么不指定?

lambda函数一般都是美丽的查询汤的最通用的方法。

 导入OS
进口SYS#导入系统库
进口重
进口的urllib2#导入自定义库
#from BeautifulSoup进口BeautifulSoup
从BS4进口BeautifulSoup#要搜索的url变量
URL ='http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179
#创建请求对象
请求= urllib2.Request(URL)#尝试打开该请求并读取响应
尝试:
    响应= urllib2.urlopen(要求)
    the_page = response.read()
除例外情况:
    the_page =
    #如果响应存在,创建它BeautifulSoup
如果(the_page):
    汤= BeautifulSoup(the_page)    #定义搜索位置所需的标签
    li_location =拉姆达X:x.name == U礼,并设置([(类,信息))< =集(x.attrs)
    x_location =拉姆达X:x.name == U块引用和布尔(re.search(\\ [XX] \\],x.text))    通过所有找到的LIS#迭代
    在soup.findAll(li_location)李:
        #打印作者姓名
        印刷字典(li.attrs)数据作者]
        通过包含标记的所有引用文字发现迭代#
        在li.findAll(x_location)XS:
            #打印找到块引用的文本
            打印xs.text
        打印

I would like to grab a parent tag if it contains within it a marker, let's say MARKER. So for example, I have:

<a>
 <b>
  <c>
  MARKER
  </c>
 </b>
 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>
 <b>
  <c>
  stuff
  </c>
 </b>
</a>

I would like to grab:

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

My current code is:

for stuff in soup.find_all(text=re.compile("MARKER")):
        post = stuff.find_parent("b")

This works, sort of, however, it gives me:

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

The reason that this is happening is obvious, it's printing the entire containing tag once for every MARKER it finds, so the tag containing two MARKERs gets printed twice. However, I'm not sure how to how to tell BeautifulSoup to not search within given tag after it's done (I suspect that, specifically, cannot be done?) or otherwise prevent this, other than perhaps indexing everything to a dictionary and rejecting duplicates?

EDIT: This is the specific case I'm working on that's giving me trouble, since for some reason, the above doesn't actually produce the error despite being a stripped version. (The particular forum thread I'm fetching a a play-by-post, if anyone's curious.)

from bs4 import BeautifulSoup
import urllib.request
import re

url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
soup = urllib.request.urlopen(url).read()
sbsoup = BeautifulSoup(soup)

for stuff in sbsoup.find_all(text=re.compile("\[[Xx]\]")):
        post = stuff.find_parent("li")
        print(post.find("a", class_="username").string)
        print(post.find("blockquote", class_="messageText ugc baseHtml").get_text())

解决方案

I wrote this with bs3, it might work with bs4 but the concepts are the same. Basically, the li tags have the user name in them under the "data-author" attribute, so you don't need to find a lower tag and then hunt for the parent li.

It seems that you are only interested in blockquote tags containing the "marker", so why not specify that ?

Lambda functions are generally the most versatile way of querying Beautiful soup.

import os
import sys

# Import System libraries
import re
import urllib2

# Import Custom libraries
#from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup

# The url variable to be searched
url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
# Create a request object
request = urllib2.Request(url)

# Attempt to open the request and read the response
try:
    response = urllib2.urlopen(request)
    the_page = response.read()
except Exception:
    the_page = ""
    # If the response exists, create a BeautifulSoup from it
if(the_page):
    soup = BeautifulSoup(the_page)

    # Define the search location for the desired tags
    li_location = lambda x: x.name == u"li" and set([("class", "message   ")]) <= set(x.attrs)
    x_location = lambda x: x.name == u"blockquote" and bool(re.search("\[[Xx]\]", x.text))

    # Iterate through all the found lis
    for li in soup.findAll(li_location):
        # Print the author name
        print dict(li.attrs)["data-author"]
        # Iterate through all the found blockquotes containing the marker
        for xs in li.findAll(x_location):
            # Print the text of the found blockquote
            print xs.text
        print ""

这篇关于BeautifulSoup抢只有一次给定的标签内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆