BeautifulSoup提取物XPATH或节点的CSS路径 [英] BeautifulSoup extract XPATH or CSS Path of node

查看:2392
本文介绍了BeautifulSoup提取物XPATH或节点的CSS路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取HTML的一些数据,然后能够突出在客户端提取的元素,而无需修改HTML源。和XPath或CSS路径看起来非常适合这一点。 是可以直接从BeautifulSoup提取XPath或CSS路径?结果
现在,我使用标识目标元素,然后LXML lib中提取XPath,它是表现非常糟糕。我知道 BSXPath.py - 它不与BS4工作。
与重写一切使用本地lxml的LIB解决方案是不能接受的,因为复杂。

 进口BS4
cStringIO导入
进口随机
从LXML进口etree
高清get_xpath(汤元):
  _id = random.getrandbits(32)
  为电子汤():
    如果E ==元素:
      E ['数据的XPath'] = _id
      打破
  其他:
    提高LookupError('找不到{} {中}。格式(元,汤))
  内容= UNI code(汤)
  DOC =调用etree.parse(cStringIO.StringIO(内容),etree.HTMLParser())
  元素= doc.xpath('// * [@数据的XPath ={}]'。格式(_id))
  断言LEN(元素)== 1
  元素=元素[0]
  XPath的= doc.getpath(元素)
  返回的XPath汤= bs4.BeautifulSoup('< D​​IV ID = I>你好,< b ID =我测试= T>!世界< / B>< / DIV>')
XPath的= get_xpath(汤,soup.div.b)
断言'// HTML / bodydiv / B'==的XPath


解决方案

它实际上是pretty容易提取简单的CSS / XPath的。这是相同的LXML LIB给你。

 高清get_element(节点):
  #为XPATH我们只与同类型的节点来算!
  长度= LEN(列表(节点。previous_siblings))+ 1
  如果(长)> 1:
    返回'%s的:第n个孩子(%S)'%(node.name,长度)
  其他:
    回报node.name高清get_css_path(节点):
  路径= [get_element(节点)]
  在node.parents父:
    如果parent.name ==体:
      打破
    path.insert(0,get_element(父))
  回归'&GT; '。加入(路径)汤= bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
断言get_css_path(soup.i)=='D​​IV:第n个孩子(2)&GT;强&GT;一世'

I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py -- it's does not work with BS4. Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.

import bs4
import cStringIO
import random
from lxml import etree


def get_xpath(soup, element):
  _id = random.getrandbits(32)
  for e in soup():
    if e == element:
      e['data-xpath'] = _id
      break
  else:
    raise LookupError('Cannot find {} in {}'.format(element, soup))
  content = unicode(soup)
  doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
  element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
  assert len(element) == 1
  element = element[0]
  xpath = doc.getpath(element)
  return xpath

soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath

解决方案

It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.

def get_element(node):
  # for XPATH we have to count only for nodes with same type!
  length = len(list(node.previous_siblings)) + 1
  if (length) > 1:
    return '%s:nth-child(%s)' % (node.name, length)
  else:
    return node.name

def get_css_path(node):
  path = [get_element(node)]
  for parent in node.parents:
    if parent.name == 'body':
      break
    path.insert(0, get_element(parent))
  return ' > '.join(path)

soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'

这篇关于BeautifulSoup提取物XPATH或节点的CSS路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆