BeautifulSoup 提取节点的 XPATH 或 CSS 路径 [英] BeautifulSoup extract XPATH or CSS Path of node

查看:25
本文介绍了BeautifulSoup 提取节点的 XPATH 或 CSS 路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 HTML 中提取一些数据,然后能够在不修改源 html 的情况下在客户端突出显示提取的元素.XPath 或 CSS Path 看起来很棒.是否可以直接从 BeautifulSoup 中提取 XPATH 或 CSS 路径?
现在我使用目标元素的标记,然后使用 lxml lib 来提取 xpath,这对性能非常不利.我知道 BSXPath.py —— 它不适用于 BS4.由于复杂性,重写所有内容以使用本机 lxml lib 的解决方案是不可接受的.

I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py -- it's does not work with BS4. Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.

import bs4
import cStringIO
import random
from lxml import etree


def get_xpath(soup, element):
  _id = random.getrandbits(32)
  for e in soup():
    if e == element:
      e['data-xpath'] = _id
      break
  else:
    raise LookupError('Cannot find {} in {}'.format(element, soup))
  content = unicode(soup)
  doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
  element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
  assert len(element) == 1
  element = element[0]
  xpath = doc.getpath(element)
  return xpath

soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath

推荐答案

提取简单的 CSS/XPath 实际上很容易.这是同一个 lxml lib 给你的.

It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.

def get_element(node):
  # for XPATH we have to count only for nodes with same type!
  length = len(list(node.previous_siblings)) + 1
  if (length) > 1:
    return '%s:nth-child(%s)' % (node.name, length)
  else:
    return node.name

def get_css_path(node):
  path = [get_element(node)]
  for parent in node.parents:
    if parent.name == 'body':
      break
    path.insert(0, get_element(parent))
  return ' > '.join(path)

soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'

这篇关于BeautifulSoup 提取节点的 XPATH 或 CSS 路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆