Python + BeautifulSoup：如何基于文本获取HTML包装？ [英] Python + BeautifulSoup: How to get wrapper out of HTML based on text?

查看：169 发布时间：2018/6/19 15:53:35 python html css python-2.7 beautifulsoup

本文介绍了Python + BeautifulSoup：如何基于文本获取HTML包装？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

想获得关键文本的包装。例如，在HTML中：

... < div class =target> chicken< / div> < div class =not-target> apple< / div> ...
并且通过基于文本的鸡肉，想要找回< div class =target>鸡< / div> 。

目前，有以下方法来获取HTML：

import从bs4中请求导入BeautifulSoup req = requests.get（url）.txt soup = BeautifulSoup（r，'html.parser'）
只需要做 soup.find_all（'div'，...）并遍历所有可用的 div 来查找我正在查找的包装器。

但是不必遍历每一个 div ，什么是获取包装器的正确且最优化的方式在HTML中基于定义文本？

预先感谢您并且一定会接受/ upvote answer！

解决方案

＃coding：utf-8 html_doc = <！DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0 Transitional // ENhttp://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"&b $ b< html xmlns = http://www.w3.org/1999/xhtml\"> < head> < meta http-equiv =Content-Typecontent =text / html; < / title> < / head> < / head> < title> body> < div id =layer1class =class1> < div id =layer2class =class2> < div id =层3class =class3> < div id =layer4class =class4> < div id =layer5class =class5> < p>我的鸡有< span style =color：blue> ONE< / span> leg：p< p> < div id =layer6class =class6> ; < div id =layer7class =class7> < div id =chicken_surnameclass =chicken>吃我< / div> < ; div id =layer8class =class8> < / div> < / div> < < / DIV> < / div> < / div> < / div> < / div> < / body> < / html> from bs4 import BeautifulSoup as BS import re soup = BS（html_doc，lxml）＃（标签 - >文本）方向非常明显， tag = soup.find（'div'，class _ =chicken） tag2 = soup。 find（'div'，{'id'：chicken_surname}） print（'\\\ ###### by_cls：'） print（tag） print '\\\ ###### by_id：'） print（tag2）＃但是当需要通过子字符串 tag_by_str = soup找到标签时可能会非常棘手。 find（string =eat me） tag_by_sub = soup.find（string =eat） tag_by_resub = soup.find（string = re.compile（eat）） print（'\\\ ###### tag_by_str：'） print（tag_by_str） print（'\\\ ###### tag_by_sub：'） print（ tag_by_sub） print（'\\\ ###### tag_by_resub：'） print（tag_by_resub）＃有多种方法可以访问底层字符串＃两者不同 - 查看结果 tag = soup.find（'p'） print（'\\\ #### ## .text attr：'） print（tag.text，type（tag.text）） print（'\\\ ###### .strings generator：' ） for tag.strings：＃strings是一个生成器对象 print s，type（s）＃注意.strings生成器返回bs4.element的列表。 NavigableString元素＃以便我们可以使用它们进行导航，例如访问其父母： print（'\\\ ###### NavigableString parents：'） for s in tag .strings： print s.parent ＃甚至是祖父母:) print（'\\\ ######祖父母：'） for s in tag.strings： print s.parent.parent

Would like to get the wrapper of a key text. For example, in HTML:
… <div class="target">chicken</div> <div class="not-target">apple</div> …
And by based on the text "chicken", would like to get back <div class="target">chicken</div>.

Currently, have the following to fetch the HTML:
import requests from bs4 import BeautifulSoup req = requests.get(url).txt soup = BeautifulSoup(r, ‘html.parser’)
And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.

But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?

Thank you in advance and will be sure to accept/upvote answer!
解决方案
# coding: utf-8 html_doc = """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title> Last chicken leg on stock! Only 500$ !!! </title> </head> </body> <div id="layer1" class="class1"> <div id="layer2" class="class2"> <div id="layer3" class="class3"> <div id="layer4" class="class4"> <div id="layer5" class="class5"> <p>My chicken has <span style="color:blue">ONE</span> leg :P</p> <div id="layer6" class="class6"> <div id="layer7" class="class7"> <div id="chicken_surname" class="chicken">eat me</div> <div id="layer8" class="class8"> </div> </div> </div> </div> </div> </div> </div> </div> </body> </html>""" from bs4 import BeautifulSoup as BS import re soup = BS(html_doc, "lxml") # (tag -> text) direction is pretty obvious that way tag = soup.find('div', class_="chicken") tag2 = soup.find('div', {'id':"chicken_surname"}) print('\n###### by_cls:') print(tag) print('\n###### by_id:') print(tag2) # but can be tricky when need to find tag by substring tag_by_str = soup.find(string="eat me") tag_by_sub = soup.find(string="eat") tag_by_resub = soup.find(string=re.compile("eat")) print('\n###### tag_by_str:') print(tag_by_str) print('\n###### tag_by_sub:') print(tag_by_sub) print('\n###### tag_by_resub:') print(tag_by_resub) # there are more than one way to access underlying strings # both are different - see results tag = soup.find('p') print('\n###### .text attr:') print( tag.text, type(tag.text) ) print('\n###### .strings generator:') for s in tag.strings: # strings is an generator object print s, type(s) # note that .strings generator returns list of bs4.element.NavigableString elements # so we can use them to navigate, for example accessing their parents: print('\n###### NavigableString parents:') for s in tag.strings: print s.parent # or even grandparents :) print('\n###### grandparents:') for s in tag.strings: print s.parent.parent

这篇关于Python + BeautifulSoup：如何基于文本获取HTML包装？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python + BeautifulSoup：如何基于文本获取HTML包装？ [英] Python + BeautifulSoup: How to get wrapper out of HTML based on text?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python + BeautifulSoup：如何基于文本获取HTML包装？ [英] Python + BeautifulSoup: How to get wrapper out of HTML based on text?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭