防止python lxml添加纯文本a< p>标签 [英] Prevent python lxml from adding plain text a <p> tag

查看:109
本文介绍了防止python lxml添加纯文本a< p>标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不希望lxml在纯文本中添加任何内容.我离开他们的目的是故意的. lxml在<p>标记中添加纯文本.在这里value可能是html或纯文本.我需要使用lxml来处理html并保留纯文本.

I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a <p> tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along.

import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
    html = lxml.html.fromstring(text)
    print(lxml.html.tostring(html))

输出: b'<p>plaintext</p>' b'<a>HTML</a>' b'<a>HTML</a>'

我需要的是: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

所以我提出了几个问题.

So I come up with several questions.

  1. 如何知道一个片段是纯净的,没有任何html标记? (这样我就不必将它们传递给lxml了),或者
  2. 如何阻止lxml向纯文本添加<p>标记?
  1. How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
  2. How to stop lxml from adding a <p> tag to plain text?

推荐答案

尝试使用此库...保存我的内容,但在处理XML页面时不必使用"re"模块,在XML页面中,由于某些愚蠢的原因,易碎的选择工作得很差. ..

try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...

from w3lib.html import remove_tags

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    follow = hxs.xpath('//loc').re('.*type=videos.*')
    follow = [remove_tags(x) for x in follow]
    # It wont remove regex lines like \n

这篇关于防止python lxml添加纯文本a&lt; p&gt;标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆