如何忽略空行，而在使用Python中.next_sibling在BeautifulSoup4 [英] How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

查看：3097 发布时间：2016/8/5 19:18:58 python html-parsing beautifulsoup

本文介绍了如何忽略空行，而在使用Python中.next_sibling在BeautifulSoup4的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因为我想删除一个HTML网站重复占位符，我用BeautifulSoup的.next_sibling运营商。只要重复都在同一直线上，这工作正常（见数据）。但有时他们之间有一个空行 - 所以我想.next_sibling不理会他们（看看数据2）

这是code：

 从BS4进口BeautifulSoup，标签
数据= \"<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>\"
DATA2 =＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;

汤= BeautifulSoup（数据）
字符串='的方法，去除-这里'
在soup.find_all（P）号码：
    而isinstance（p.next_sibling，标签），并p.next_sibling.name =='P'和p.text ==字符串：
        p.next_sibling.decompose（）
打印（汤）

有关数据输出为预期：

<$p$p><$c$c><html><head></head><body><p>method-removed-here</p></body></html>

有关数据2输出（这需要固定）：

 ＆LT; HTML＆GT;＆LT; HEAD＆GT;＆LT; /头＆GT;＆LT;身体GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;＆LT; P＆GT;的方法，去除-此处＆lt; / P＆GT;
＆LT; /身体GT;＆LT; / HTML＆GT;

我无法在BeautifulSoup4文档中找到有用的信息和.next_element也是不是我期待的。

解决方案

我可以解决一个解决方法这个问题。该问题在谷歌的基团为BeautifulSoup 的描述和它们建议使用preprocessor为HTML的文件：

 高清BS_ preprocess（HTML）：
     删除分心空格和换行字符
     拍拍= re.compile（'（^ [\\ S] +）|（[\\ S] + $）'，re.MULTILINE）
     HTML =应用re.sub（PAT，''，HTML）＃删除开头和结尾的空格
     HTML =应用re.sub（'\\ n'，''，HTML）＃转换换行符空间
                                        ＃本preserves换行分隔符
     HTML =应用re.sub（'[\\ S] +＆LT;'，'＆LT;'，HTML）＃打开标签之前删除空格
     HTML =应用re.sub（'＆GT; [\\ S] +'，'＆GT;'，HTML）＃结束标记后删除空格
     返回HTML

这不是最好的解决方案，但之一。

As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2)

That is the code:

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
    while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
        p.next_sibling.decompose()
print(soup)

Output for data is as expected:

<html><head></head><body><p>method-removed-here</p></body></html>

Output for data2 (this needs to be fixed):

<html><head></head><body><p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
</body></html>

I couldn't find useful information for that in the BeautifulSoup4 documentation and .next_element is also not what i am looking for.

解决方案

I could solve this issue with a workaround. The problem is described in the google-group for BeautifulSoup and they suggest to use a preprocessor for html-files:

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

That's not the very best solution but one.

这篇关于如何忽略空行，而在使用Python中.next_sibling在BeautifulSoup4的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何忽略空行，而在使用Python中.next_sibling在BeautifulSoup4 [英] How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何忽略空行，而在使用Python中.next_sibling在BeautifulSoup4 [英] How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭