获取BeautifulSoup以正确解析php标记或忽略它们 [英] Get BeautifulSoup to correctly parse php tags or ignore them

查看:85
本文介绍了获取BeautifulSoup以正确解析php标记或忽略它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前需要解析许多.phtml文件,获取特定的html标记并向其添加自定义数据属性. 我正在使用python beautifulsoup解析整个文档并添加标签,这部分工作正常.

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.

问题在于,在视图文件(phtml)上也有被解析的标签.以下是输入输出的示例

The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output

输入

<?php

$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
    <h3>
        <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

输出

<?php
$stars = $this->
getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this-&gt;getData('sideBarCoStarsCount');
$title = $this-&gt;getData('sideBarCoStarsTitle');
$viewAllUrl = $this-&gt;getData('sideBarCoStarsViewAllUrl');
$isDomain = $this-&gt;getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this-&gt;getData('emptyImageData');
?&gt;
<header>
 <h3>
  <a class="noContentLink white" href="&lt;?php echo $viewAllUrl; ?&gt;">
   <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
  </a>
 </h3>

我尝试了不同的方法,但是在使beautifulsoup忽略PHP标记方面并没有成功. 是否有可能使html.parser自定义规则忽略或beautifulsoup? 谢谢!

I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!

推荐答案

您最好的选择是删除所有PHP元素,然后再将其提供给BeautifulSoup进行解析.这可以通过使用正则表达式来发现所有PHP部分并将其替换为安全的占位符文本来完成.

Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

使用BeautifulSoup完成所有修改后,即可替换PHP表达式.

After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

由于PHP可以在任何地方,即也可以在带引号的字符串中,所以最好使用简单的唯一字符串占位符,而不是尝试将其包装在HTML注释中(请参见php_sig).

As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

re.sub()可以被赋予功能.每次进行替换时,原始PHP代码都存储在数组(php_elements)中.然后进行相反的操作,即搜索php_sig的所有实例,并将其替换为php_elements中的下一个元素.如果一切顺利,php_elements最后应为空,否则,您的修改将导致占位符被删除.

re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

from bs4 import BeautifulSoup
import re

html = """<html>
<body>

<?php 
$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>

<header>
    <h3>
        <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

</body>"""

php_sig = '!!!PHP!!!'
php_elements = []

def php_remove(m):
    php_elements.append(m.group())
    return php_sig

def php_add(m):
    return php_elements.pop(0)

# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S+re.M)

soup = BeautifulSoup(html, "html.parser")

# Make modifications to the soup
# Do not remove any elements containing PHP elements

# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())

print(html)

这篇关于获取BeautifulSoup以正确解析php标记或忽略它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆