如何仅提取元素和文本(过滤掉属性、类、内嵌 css) [英] how to extract element and text only (filter out attributes, class, in-line css)

查看:53
本文介绍了如何仅提取元素和文本(过滤掉属性、类、内嵌 css)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行这个

hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()

这是示例输出

<strong>位置:</strong>阿拉伯联合酋长国阿布扎比

<div class="OneLinkNoTx"><strong>旅行百分比:</strong>没有任何

<div align="justify">薪资:10万

我希望输出看起来像这样

<strong>位置:</strong>阿拉伯联合酋长国阿布扎比

<div><strong>旅行百分比:</strong>没有任何

<div>薪资:10万

我只想让 html 元素没有任何 html 属性.可以使用scrapy/xpath吗?

解决方案

您可以使用 lxml 的清理器.

在[1]中:导入lxml.html在 [2]:导入 lxml.html.clean在[3]中:html="""

<strong>位置:</strong>阿拉伯联合酋长国阿布扎比

<div class="OneLinkNoTx"><strong>旅行百分比:</strong>没有任何

<div align="justify">薪资:10万</div>"""在 [4]: doc = lxml.html.fromstring(html)在 [5] 中:clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())在 [6]:清洁(文档)在 [7] 中:打印 lxml.html.tostring(doc)<div><div><strong>位置:</strong>阿拉伯联合酋长国阿布扎比

<div><strong>旅行百分比:</strong>没有任何

<div>薪资:10万</div></div>

缺点是 lxml 添加了一个包装器 div.为了避免这种情况,您可以这样做:

In [28]: elements = lxml.html.fragments_fromstring(html)在 [29] 中:地图(干净,元素)出[29]:[无,无,无]在 [30] 中:打印 ''.join(map(lxml.html.tostring, elements))<div><strong>位置:</strong>阿拉伯联合酋长国阿布扎比

<div><strong>旅行百分比:</strong>没有任何

<div>薪资:10万

注意 clean 就地修改元素.

Run this

hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()

And this is example output

<div class="OneLinkNoTx">
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
    <strong>Travel Percentage:</strong> 
    None
</div>
<div align="justify">
    Salary: 100k
</div>

I want the output to look like this

<div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div>

I just want to have the html element w/o any html attributes. Is it possible with scrapy/xpath ?

解决方案

You can use lxml's Cleaner.

In [1]: import lxml.html

In [2]: import lxml.html.clean

In [3]: html = """<div class="OneLinkNoTx">
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
    <strong>Travel Percentage:</strong> 
    None
</div>
<div align="justify">
    Salary: 100k
</div>"""

In [4]: doc = lxml.html.fromstring(html)

In [5]: clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())

In [6]: clean(doc)

In [7]: print lxml.html.tostring(doc)
<div><div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div></div>

The drawback is that lxml adds a wrapper div. To avoid that you can do this:

In [28]: elements = lxml.html.fragments_fromstring(html)

In [29]: map(clean, elements)
Out[29]: [None, None, None]

In [30]: print ''.join(map(lxml.html.tostring, elements))
<div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div>

Notice that clean modifies the elements in-place.

这篇关于如何仅提取元素和文本(过滤掉属性、类、内嵌 css)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆