使用Python 2.7 HTML解析树 [英] HTML Parse tree using Python 2.7

查看:567
本文介绍了使用Python 2.7 HTML解析树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让配置一个解析树下面的HTML表,但不能形成它。我希望看到的树形结构的样子!任何人都可以帮助我在这里?

 #< HTML和GT;
#< HEAD>
#<标题>
#睡鼠的故事
#< /标题>
#< /头>
#<身体GT;
#< p =班称号>
#< B>
#睡鼠的故事
#< / B>
#&所述; / P>
#< p =班故事>
#曾几何时,有三个小姐妹;她们的名字是
#<一类=姐姐HREF =htt​​p://example.com/elsieID =链接1>
杜#
#&所述; / A>

#<一类=姐姐HREF =htt​​p://example.com/lacieID =链接2>
#莱希
#&所述; / A>
#和
#<一类=姐姐HREF =htt​​p://example.com/tillieID =链接2>
#蒂莉
#&所述; / A>
#;和她们住在一个井底。
#&所述; / P>
#< p =班故事>
#...
#&所述; / P>
#< /身体GT;
#< / HTML>

修改

 的Microsoft Windows [版本6.1.7600]
版权所有(c)2009年微软公司。版权所有。C:\\用户\\亚光>的easy_install ete2
搜索ete2
阅读http://pypi.python.org/simple/ete2/
阅读http://ete.cgenomics.org
阅读http://ete.cgenomics.org/releases/ete2/
阅读http://ete.cgenomics.org/releases/ete2
最佳搭配:ete2 2.1rev539
下载http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
处理ete2-2.1rev539.tar.gz
运行ete2-2.1rev539 \\ setup.py -q bdist_egg --dist-DIR C:\\用户\\ arupra〜1 \\ appdat
一\\本地\\ TEMP \\ easy_install的-sypg3x \\ ete2-2.1rev539 \\鸡蛋DIST-TMP-zemohm安装ETE(一个Python环境树探索)。检查依赖...
numpy的不能在你的Python安装中找到。
numpy的是所必需的ArrayTable和ClusterTree类。
MySQLdb的不能在你的Python安装中找到。
MySQLdb的是所必需的PhylomeDB访问API。
PyQt4中不能在你的Python安装中找到。
PyQt4中所需的树可视化和图像渲染。
LXML不能在你的Python安装中找到。
LXML从Nexml和Phyloxml支持所需。但是,您仍然可以安装ETE没有这样的功能。
你仍要继续安装? [Y,N〕Y
您的安装ID是:d33ba3b425728e95c47cdd98acda202f
警告:没有找到文件目录下匹配的'*'。
警告:没有文件找到匹配的目录下的'*''。
警告:manifest_maker:MANIFEST.in,4号线:路径DOC / ete_guide /无法而结束
第i'/'警告:manifest_maker:MANIFEST.in,5号线:路径doc /'下不能结束'/'警告:没有previously的包括文件匹配'* .pyc文件目录下找到。zip_safe标志不设置;分析存档内容...
添加ete2 2.1rev539到易install.pth文件
安装ete2脚本到C:\\ Python27 \\ Scripts中安装的C:\\ python27 \\ LIB \\站点包\\ ete2-2.1rev539,py2.7.egg
对于ete2处理依赖
对于ete2成品加工的依赖


解决方案

这答案来得有点晚了,但我仍想与大家分享吧:

我用 networkx LXML (我发现让DOM树的更优雅的遍历)。然而,树布局取决于 graphviz的 pygraphviz 的安装。 networkx本身就不知怎么竟分布在画布上的节点。在code实际上比要求的,因为我绘制的标签再自己把他们盒装(networkx提供绘制标签,但它没有通过的 BBOX 关键字到matplotlib)。

 进口networkx为NX
从LXML导入HTML
进口matplotlib.pyplot如PLT原料=...您的原始的HTML高清遍历(父母,图形,标签等):
    标签[家长] = parent.tag
    在parent.getchildren()节点:
        graph.add_edge(父母,节点)
        遍历(节点,图形,标签)G = nx.DiGraph()
标签= {}#需要从节点映射到标签
HTML_TAG = html.document_fromstring(生)
遍历(HTML_TAG,G,标签)POS = nx.graphviz_layout(G,PROG =点)label_props = {'大小':16,
               '颜色':'黑',
               '重':'大胆',
               '的Horizo​​ntalAlignment':'中心',
               verticalalignment':'中心',
               clip_on:真正}
bbox_props = {'boxstyle:轮,垫= 0.2,
              FC:灰色,
              'EC':b的,
              LW:1.5}nx.draw_networkx_edges(G,POS,箭头= FALSE)
AX = plt.gca()节点,标号labels.items():
        X,Y = POS [节点]
        ax.text(X,Y​​,标签,
                BBOX = bbox_props,
                ** label_props)ax.xaxis.set_visible(假)
ax.yaxis.set_visible(假)
plt.show()

更改为code,如果你preFER(或必须)使用BeautifulSoup:

我不是专家...只是看着BS4的第一次,...但它的工作原理:

  #from LXML导入HTML
从BS4进口BeautifulSoup
从bs4.element进口NavigableString...高清遍历(父母,图形,标签等):
    标签[散列(父)] = parent.name
    在parent.children节点:
        如果isinstance(节点,NavigableString):
            继续
        graph.add_edge(散列(父),散列(节点))
        遍历(节点,图形,标签)...#html_tag = html.document_fromstring(生)
汤= BeautifulSoup(生)
HTML_TAG =下一个(soup.children)...

I was trying to get configure one parse tree for the below HTML table,but couldn't form it.I want to see how the tree structure looks like!can anyone help me here?

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

EDIT

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\matt>easy_install ete2
Searching for ete2
Reading http://pypi.python.org/simple/ete2/
Reading http://ete.cgenomics.org
Reading http://ete.cgenomics.org/releases/ete2/
Reading http://ete.cgenomics.org/releases/ete2
Best match: ete2 2.1rev539
Downloading http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
Processing ete2-2.1rev539.tar.gz
Running ete2-2.1rev539\setup.py -q bdist_egg --dist-dir c:\users\arupra~1\appdat
a\local\temp\easy_install-sypg3x\ete2-2.1rev539\egg-dist-tmp-zemohm

Installing ETE (A python Environment for Tree Exploration).

Checking dependencies...
numpy cannot be found in your python installation.
Numpy is required for the ArrayTable and ClusterTree classes.
MySQLdb cannot be found in your python installation.
MySQLdb is required for the PhylomeDB access API.
PyQt4 cannot be found in your python installation.
PyQt4 is required for tree visualization and image rendering.
lxml cannot be found in your python installation.
lxml is required from Nexml and Phyloxml support.

However, you can still install ETE without such functionality.
Do you want to continue with the installation anyway? [y,n]y
Your installation ID is: d33ba3b425728e95c47cdd98acda202f
warning: no files found matching '*' under directory '.'
warning: no files found matching '*.*' under directory '.'
warning: manifest_maker: MANIFEST.in, line 4: path 'doc/ete_guide/' cannot end w
ith '/'

warning: manifest_maker: MANIFEST.in, line 5: path 'doc/' cannot end with '/'

warning: no previously-included files matching '*.pyc' found under directory '.'

zip_safe flag not set; analyzing archive contents...
Adding ete2 2.1rev539 to easy-install.pth file
Installing ete2 script to C:\Python27\Scripts

Installed c:\python27\lib\site-packages\ete2-2.1rev539-py2.7.egg
Processing dependencies for ete2
Finished processing dependencies for ete2

解决方案

This answer comes a bit late, but still I'd like to share it:

I used networkx and lxml (which I found to allow much more elegant traversal of the DOM-tree). However, the tree-layout depends on graphviz and pygraphviz installed. networkx itself would just distribute the nodes somehow on the canvas. The code actually is longer than required cause I draw the labels myself to have them boxed (networkx provides for drawing the labels but it doesn't pass on the bbox keyword to matplotlib).

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = nx.graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=False)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,
                bbox=bbox_props,
                **label_props)

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()

Changes to the code if you prefer (or have) to use BeautifulSoup:

I'm no expert... just looked at BS4 for the first time,... but it works:

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString

...

def traverse(parent, graph, labels):
    labels[hash(parent)] = parent.name
    for node in parent.children:
        if isinstance(node, NavigableString):
            continue
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)

...

#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)

...

这篇关于使用Python 2.7 HTML解析树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆