- 首页
- 其他开发
- 如何仅提取元素和文本(过滤掉属性、类、内嵌 css)
如何仅提取元素和文本(过滤掉属性、类、内嵌 css)
[英] how to extract element and text only (filter out attributes, class, in-line css)
本文介绍了如何仅提取元素和文本(过滤掉属性、类、内嵌 css)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
运行这个
hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()
这是示例输出
<strong>位置:</strong>阿拉伯联合酋长国阿布扎比
<div class="OneLinkNoTx"><strong>旅行百分比:</strong>没有任何
<div align="justify">薪资:10万
我希望输出看起来像这样
<strong>位置:</strong>阿拉伯联合酋长国阿布扎比
<div><strong>旅行百分比:</strong>没有任何
<div>薪资:10万
我只想让 html 元素没有任何 html 属性.可以使用scrapy/xpath吗?
解决方案
您可以使用 lxml 的清理器.
在[1]中:导入lxml.html在 [2]:导入 lxml.html.clean在[3]中:html="""<strong>位置:</strong>阿拉伯联合酋长国阿布扎比
<div class="OneLinkNoTx"><strong>旅行百分比:</strong>没有任何
<div align="justify">薪资:10万</div>"""在 [4]: doc = lxml.html.fromstring(html)在 [5] 中:clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())在 [6]:清洁(文档)在 [7] 中:打印 lxml.html.tostring(doc)<div><div><strong>位置:</strong>阿拉伯联合酋长国阿布扎比
<div><strong>旅行百分比:</strong>没有任何
<div>薪资:10万</div></div>
缺点是 lxml 添加了一个包装器 div
.为了避免这种情况,您可以这样做:
In [28]: elements = lxml.html.fragments_fromstring(html)在 [29] 中:地图(干净,元素)出[29]:[无,无,无]在 [30] 中:打印 ''.join(map(lxml.html.tostring, elements))<div><strong>位置:</strong>阿拉伯联合酋长国阿布扎比<div><strong>旅行百分比:</strong>没有任何
<div>薪资:10万
注意 clean
就地修改元素.
Run this
hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()
And this is example output
<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>
I want the output to look like this
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>
I just want to have the html element w/o any html attributes. Is it possible with scrapy/xpath ?
解决方案
You can use lxml's Cleaner.
In [1]: import lxml.html
In [2]: import lxml.html.clean
In [3]: html = """<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>"""
In [4]: doc = lxml.html.fromstring(html)
In [5]: clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())
In [6]: clean(doc)
In [7]: print lxml.html.tostring(doc)
<div><div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div></div>
The drawback is that lxml adds a wrapper div
. To avoid that you can do this:
In [28]: elements = lxml.html.fragments_fromstring(html)
In [29]: map(clean, elements)
Out[29]: [None, None, None]
In [30]: print ''.join(map(lxml.html.tostring, elements))
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>
Notice that clean
modifies the elements in-place.
这篇关于如何仅提取元素和文本(过滤掉属性、类、内嵌 css)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文