防止python lxml添加纯文本a标签 [英] Prevent python lxml from adding plain text a tag

查看：109 发布时间：2020/5/4 8:36:35 python lxml

本文介绍了防止python lxml添加纯文本a标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不希望lxml在纯文本中添加任何内容.我离开他们的目的是故意的. lxml在标记中添加纯文本.在这里value可能是html或纯文本.我需要使用lxml来处理html并保留纯文本.

I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a  tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along.

import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
    html = lxml.html.fromstring(text)
    print(lxml.html.tostring(html))

输出: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

我需要的是: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

所以我提出了几个问题.

So I come up with several questions.

如何知道一个片段是纯净的，没有任何html标记? (这样我就不必将它们传递给lxml了)，或者
如何阻止lxml向纯文本添加标记?

How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
How to stop lxml from adding a  tag to plain text?

推荐答案

尝试使用此库...保存我的内容，但在处理XML页面时不必使用"re"模块，在XML页面中，由于某些愚蠢的原因，易碎的选择工作得很差. ..

try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...

from w3lib.html import remove_tags

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    follow = hxs.xpath('//loc').re('.*type=videos.*')
    follow = [remove_tags(x) for x in follow]
    # It wont remove regex lines like \n

这篇关于防止python lxml添加纯文本a标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

防止python lxml添加纯文本a< p>标签 [英] Prevent python lxml from adding plain text a <p> tag

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

防止python lxml添加纯文本a&lt; p&gt;标签 [英] Prevent python lxml from adding plain text a &lt;p&gt; tag

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

防止python lxml添加纯文本a< p>标签 [英] Prevent python lxml from adding plain text a <p> tag

登录关闭