Python + BeautifulSoup:如何基于文本获取HTML包装? [英] Python + BeautifulSoup: How to get wrapper out of HTML based on text?

查看:169
本文介绍了Python + BeautifulSoup:如何基于文本获取HTML包装?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想获得关键文本的包装。例如,在HTML中:

  ... 
< div class =target> chicken< / div>
< div class =not-target> apple< / div>
...

并且通过基于文本的鸡肉,想要找回< div class =target>鸡< / div>



目前,有以下方法来获取HTML:

  import从bs4中请求
导入BeautifulSoup

req = requests.get(url).txt
soup = BeautifulSoup(r,'html.parser')

只需要做 soup.find_all('div',...)并遍历所有可用的 div 来查找我正在查找的包装器。



但是不必遍历每一个 div ,什么是获取包装器的正确且最优化的方式在HTML中基于定义文本?

预先感谢您并且一定会接受/ upvote answer!

解决方案

 #coding:utf-8 

html_doc =
<!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0 Transitional // ENhttp://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"&b $ b< html xmlns = http://www.w3.org/1999/xhtml\">
< head>
< meta http-equiv =Content-Typecontent =text / html; < / title>
< / head>
< / head>
< title> body>
< div id =layer1class =class1>
< div id =layer2class =class2>
< div id =层3class =class3>
< div id =layer4class =class4>
< div id =layer5class =class5>
< p>我的鸡有< span style =color:blue> ONE< / span> leg:p< p>
< div id =layer6class =class6> ;
< div id =layer7class =class7>
< div id =chicken_surnameclass =chicken>吃我< / div>
< ; div id =layer8class =class8>
< / div>
< / div>
< < / DIV>
< / div>
< / div>
< / div>
< / div>
< / body>
< / html>

from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc,lxml)


#(标签 - >文本)方向非常明显,
tag = soup.find('div',class _ =chicken)
tag2 = soup。 find('div',{'id':chicken_surname})
print('\\\
###### by_cls:')
print(tag)
print '\\\
###### by_id:')
print(tag2)

#但是当需要通过子字符串
tag_by_str = soup找到标签时可能会非常棘手。 find(string =eat me)
tag_by_sub = soup.find(string =eat)
tag_by_resub = soup.find(string = re.compile(eat))
print('\\\
###### tag_by_str:')
print(tag_by_str)
print('\\\
###### tag_by_sub:')
print( tag_by_sub)
print('\\\
###### tag_by_resub:')
print(tag_by_resub)

#有多种方法可以访问底层字符串
#两者不同 - 查看结果
tag = soup.find('p')

print('\\\
#### ## .text attr:')
print(tag.text,type(tag.text))

print('\\\
###### .strings generator:' )
for tag.strings:#strings是一个生成器对象
print s,type(s)

#注意.strings生成器返回bs4.element的列表。 NavigableString元素
#以便我们可以使用它们进行导航,例如访问其父母:
print('\\\
###### NavigableString parents:')
for s in tag .strings:
print s.parent

#甚至是祖父母:)
print('\\\
######祖父母:')
for s in tag.strings:
print s.parent.parent


Would like to get the wrapper of a key text. For example, in HTML:

…
<div class="target">chicken</div>
<div class="not-target">apple</div>
…

And by based on the text "chicken", would like to get back <div class="target">chicken</div>.

Currently, have the following to fetch the HTML:

import requests
from bs4 import BeautifulSoup

req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)

And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.

But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?

Thank you in advance and will be sure to accept/upvote answer!

解决方案

# coding: utf-8

html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title> Last chicken leg on stock! Only 500$ !!! </title>
  </head>
  </body>
    <div id="layer1" class="class1">
        <div id="layer2" class="class2">
            <div id="layer3" class="class3">
                <div id="layer4" class="class4">
                    <div id="layer5" class="class5">
                      <p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
                        <div id="layer6" class="class6">
                            <div id="layer7" class="class7">
                              <div id="chicken_surname" class="chicken">eat me</div>
                                <div id="layer8" class="class8">
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
  </body>
</html>"""

from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")


# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)

# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)

# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')

print('\n###### .text attr:')
print( tag.text, type(tag.text) )

print('\n###### .strings generator:')
for s in tag.strings:   # strings is an generator object
    print s, type(s)

# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:  
    print s.parent

# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:  
    print s.parent.parent

这篇关于Python + BeautifulSoup:如何基于文本获取HTML包装?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆