用nokogiri去掉样式属性 [英] Strip style attributes with nokogiri
问题描述
我正在用nokogiri取消一个html页面,我想删除所有样式属性。
我该如何实现这一目标? (我不使用铁轨,所以我不能使用它的sanitize方法,我不想用sanitize宝石,因为我想黑名单删除不白名单)
html = open(url)
doc = Nokogiri :: HTML(html.read)
doc.css('。post')。
puts post.to_s
end
=> < p>< span style =font-size:x-large> bla bla< a href =http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us- 110427 /> STATISTICA< / A>布拉布拉< /跨度>< / p为H.
我希望它是
=> < p>< span> bla bla< a href =http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/> statistica< / a>布拉布拉< /跨度>< / p为H.
require' nokogiri'
html ='< p class =post>< span style =font-size:x-large> bla bla< / span>< / p> '
doc = Nokogiri :: HTML(html)
doc.xpath('// @ style')。remove
puts doc.css('。post')
# => < p class =post>< span> bla bla< / span>< / p>
编辑以显示您可以调用 NodeSet#remove
,而不必使用 .each(& $删除)
。
请注意,如果您有DocumentFragment而不是Document ,Nokogiri的一个长期存在的错误,在这个错误中,从一个片段中搜索并不像您期望的那样工作。解决方法是使用:
doc.xpath('@ style | .//@ style')。remove
I'm scrapling an html page with nokogiri and i want to strip out all style attributes.
How can I achieve this? (i'm not using rails so i can't use it's sanitize method and i don't want to use sanitize gem 'cause i want to blacklist remove not whitelist)
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.css('.post').each do |post|
puts post.to_s
end
=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>
I want it to be
=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>
require 'nokogiri'
html = '<p class="post"><span style="font-size: x-large">bla bla</span></p>'
doc = Nokogiri::HTML(html)
doc.xpath('//@style').remove
puts doc.css('.post')
#=> <p class="post"><span>bla bla</span></p>
Edited to show that you can just call NodeSet#remove
instead of having to use .each(&:remove)
.
Note that if you have a DocumentFragment instead of a Document, Nokogiri has a longstanding bug where searching from a fragment does not work as you would expect. The workaround is to use:
doc.xpath('@style|.//@style').remove
这篇关于用nokogiri去掉样式属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!