防止python lxml添加纯文本a< p>标签 [英] Prevent python lxml from adding plain text a <p> tag
本文介绍了防止python lxml添加纯文本a< p>标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我不希望lxml在纯文本中添加任何内容.我离开他们的目的是故意的. lxml在<p>
标记中添加纯文本.在这里value
可能是html或纯文本.我需要使用lxml来处理html并保留纯文本.
I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a <p>
tag. Here value
might be html or plaintext. I need lxml to process html and leave plaintext along.
import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
html = lxml.html.fromstring(text)
print(lxml.html.tostring(html))
输出:
b'<p>plaintext</p>'
b'<a>HTML</a>'
b'<a>HTML</a>'
我需要的是:
b'plaintext'
b'<a>HTML</a>'
b'<a>HTML</a>'
所以我提出了几个问题.
So I come up with several questions.
- 如何知道一个片段是纯净的,没有任何html标记? (这样我就不必将它们传递给lxml了),或者
- 如何阻止lxml向纯文本添加
<p>
标记?
- How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
- How to stop lxml from adding a
<p>
tag to plain text?
推荐答案
尝试使用此库...保存我的内容,但在处理XML页面时不必使用"re"模块,在XML页面中,由于某些愚蠢的原因,易碎的选择工作得很差. ..
try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...
from w3lib.html import remove_tags
def parse(self, response):
hxs = HtmlXPathSelector(response)
follow = hxs.xpath('//loc').re('.*type=videos.*')
follow = [remove_tags(x) for x in follow]
# It wont remove regex lines like \n
这篇关于防止python lxml添加纯文本a< p>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文