XPath通过超链接获取文本(Python) [英] XPath taking text with hyperlinks (Python)

查看:1227
本文介绍了XPath通过超链接获取文本(Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用XPath的新手(而且我一般是Python的相对初学者).我正试图通过它从Wikipedia页面的第一段中删除文本.

I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it.

以Python页面为例( https://en.wikipedia.org/wiki /Python_(programming_language))

Take for instance the Python Page (https://en.wikipedia.org/wiki/Python_(programming_language))

如果我将其放入变量

page = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
tree = html.fromstring(page.content)

然后我知道所需的段落在XPath /html/body/div[3]/div[3]/div[4]/div/p[1]

Then I know the desired paragraph is on XPath /html/body/div[3]/div[3]/div[4]/div/p[1]

所以我将文本带入一个变量

So I take that text into a variable

first = tree.xpath("/html/body/div[3]/div[3]/div[4]/div/p[1]/text()")

此输出结果

[' is an ', ' ', ' for ', '. Created by ', ' and first released in 1991, Python has a design philosophy that emphasizes ', ', notably using ', '. It provides constructs that enable clear programming on both small and large scales.', '\n']

如您所见,我缺少Web链接中的单词/句子.

As you can see I'm missing the words/sentences that are inside of web links.

推荐答案

您的XPath查询仅与该节点的文本子节点匹配.嵌入的文本存在于另一个节点上,因此被排除在外.

Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.

  1. 要下降,请按照建议使用//text();这将从相关节点开始检索任何降序节点的文本值.

  1. To descend use //text() as suggested; this will retrieve the text value of any descending node starting from the node in question.

/html/body/div[3]/div[3]/div[4]/div/p[1]//text()

  • 或者,您可以选择有问题的节点本身,然后使用解析器方法text_content()检索文本,以检索包括所有子节点的文本.

  • Alternatively, you can select the node in question itself and retrieve the text using a parser method text_content() to retrieve the text including all child nodes.

    lxml import html
    import requests
    
    page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
    tree = html.fromstring(page.content)
    firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
    firstp[0].text_content()
    

    这篇关于XPath通过超链接获取文本(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆