scrapy:从xpath选择器中删除元素 [英] scrapy: Remove elements from an xpath selector

查看：716 发布时间：2020/5/4 8:37:07 xpath lxml scrapy

本文介绍了scrapy:从xpath选择器中删除元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用scrapy爬网某些奇怪的格式约定的网站.基本思想是，我希望某个div的所有文本和子元素，除了开头的几个和结尾的几个.

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.

这是要点.

<div id="easy-id">
  <stuff I don't want>
  text I don't want
  <div id="another-easy-id" more stuff I don't want>

  text I want
  <stuff I want>
  ...
  <more stuff I want>
  text I want
  ...

  <div id="one-more-easy-id" more stuff I *don't* want>
  <more stuff I *don't* want>

注意:缩进意味着要关闭标签，所以这里的所有内容都是第一个div的子级-id ="easy-id"的那个子

NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"

由于文本和节点混合在一起，所以我一直无法找出一个简单的xpath选择器来获取我想要的东西.此时，我想知道是否有可能从xpath中将结果检索为lxml.etree.elementTree，然后使用.remove()方法对其进行破解.

Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.

有什么建议吗?

推荐答案

我猜您希望从ID为another-easy-id的div到但不包括一个more-easy-id div的所有内容.

I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.

堆栈溢出没有保留缩进，因此我不知道第一个div元素的结尾在哪里，但是我猜它在文本之前结束.

Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.

在这种情况下，您可能需要 //div [@id ='another-easy-id']/following:node() [not(preceding :: div [@id ='one-more-easy-id'])and not(@id ='one-more-easy-id')]

In that case you might want //div[@id = 'another-easy-id']/following:node() [not(preceding::div[@id = 'one-more-easy-id']) and not(@id = 'one-more-easy-id')]

如果这是XHTML，则需要将一些前缀h绑定到XHTML命名空间，并在两个地方都使用h:div.

If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.

这是我最后使用的语法. (有关原因，请参阅评论.)

Here's the syntax I went with in the end. (See comments for the reasons.)

//div[@id='easy-id']/div[@id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[@id='another-easy-id']]

这篇关于scrapy:从xpath选择器中删除元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scrapy:从xpath选择器中删除元素 [英] scrapy: Remove elements from an xpath selector

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

scrapy:从xpath选择器中删除元素 [英] scrapy: Remove elements from an xpath selector

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭