XPATH - 与很多孩子的HTML [英] XPATH - html with a lot of children

查看:172
本文介绍了XPATH - 与很多孩子的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑变量中的html。



如何访问 td



我想像 xpath(/ table / tr / td / text())那样访问它们。



我不想指出其他 tr s



xpath('.// table / tr / tr / tr / td / text()')也不起作用。



Python代码:

  import __future__ $ lxml中的b $ b导入html 
导入请求
from bs4 import BeautifulSoup

page =
<!DOCTYPE html>
< html lang =en>
< head> ;
< meta charset =UTF-8>
< title> cv< / title>
< / head>
< body>

< table>
< tr>
< tr>
< tr>
< td> table1 td1< / td>
< td> table1 td2< / td>
< / tr>
< / tr>
< / tr>
< / table>

< table>
< tr>
< tr>
< tr>
< td> table2 td1< / td>
< td> table2 td2< / td>
< / tr>
< / tr>
< / tr>
< / table>

< table>
< tr>
< tr>
< tr>
< td> table3 td1< / td>
< td> table3 td2< / td>
< / tr>
< / tr>
< / tr>
< / table>
< / body>
< / html>


汤= str(BeautifulSoup(page,'html.parser'))
tree = html.fromstring(汤)

things = tree.xpath('.// table / tr / tr / tr / td / text()')

print(things)

for things in things:
print(thing)

print('That's all')



<

解决方案

使用xpath // td / text()

  things = tree.xpath('// td / text()')

// td 代表find any <$

适用于我。



打印 td 元素按照表格

分组:

<$ p $对于doc.xpath中的table_elm(/​​/表),

打印另一个表
things = table_elm .xpath('.// td / text()')
print(things)

请注意,在这个CAS中e是xpath重要的


Consider the html in the page variable.

How do I access the tds ?

I want to access them like xpath("/table/tr/td/text())"

I don't want to indicate the other trs

Unfortunately this expression xpath('.//table/tr/tr/tr/td/text()') doesn't work either.

Python code:

import __future__
from lxml import html
import requests
from bs4 import BeautifulSoup

page = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>cv</title>
</head>
<body>

    <table>
        <tr>
            <tr>
                <tr>
                    <td>table1 td1</td>
                    <td>table1 td2</td>
                </tr>
            </tr>
        </tr>
    </table>

    <table>
        <tr>
            <tr>
                <tr>
                    <td>table2 td1</td>
                    <td>table2 td2</td>
                </tr>
            </tr>
        </tr>
    </table>

    <table>
        <tr>
            <tr>
                <tr>
                    <td>table3 td1</td>
                    <td>table3 td2</td>
                </tr>
            </tr>
        </tr>
    </table>
</body>
</html>
"""

soup = str(BeautifulSoup(page, 'html.parser'))
tree = html.fromstring(soup)

things = tree.xpath('.//table/tr/tr/tr/td/text()')

print(things)

for thing in things:
        print(thing)

print('That's all')

I want it from the root!

解决方案

Use xpath //td/text():

things = tree.xpath('//td/text()')

The //td stands for "find any td element in any depth.

Works for me.

Printing td elements grouped per table:

doc = html.fromstring(page)
for table_elm in doc.xpath("//table"):
    print "another table"
    things = table_elm.xpath('.//td/text()')
    print(things)

Note, that in this case is the . in xpath significant.

这篇关于XPATH - 与很多孩子的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆