用python请求解析HTML [英] Parsing HTML with python request

查看:176
本文介绍了用python请求解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不是编码员,但我需要实现一个简单的HTML解析器.

im not a coder but i need to implement a simple HTML parser.

通过简单的研究,我能够实现一个给定的示例:

After a simple research i was able to implement as a given example:

from lxml import html
import requests

page = requests.get('https://URL.COM')
tree = html.fromstring(page.content)

#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

如何使用tree.xpath解析所有以".com.br"结尾并以://"开头的单词

How can i use tree.xpath to parse all words ending with ".com.br" and starting with "://"

推荐答案

正如@nosklo在此处指出的那样,您正在寻找href标记和关联的链接.解析树将由html元素本身组织,您可以通过专门搜索这些元素来查找文本.对于网址,它看起来像这样(使用python 3.6中的lxml库):

As @nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):

from lxml import etree
from io import StringIO
import requests

# Set explicit HTMLParser
parser = etree.HTMLParser()

page = requests.get('https://URL.COM')

# Decode the page content from bytes to string
html = page.content.decode("utf-8")

# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)

# Call this function and pass in your tree
def get_links(tree):
    # This will get the anchor tags <a href...>
    refs = tree.xpath("//a")
    # Get the url from the ref
    links = [link.get('href', '') for link in refs]
    # Return a list that only ends with .com.br
    return [l for l in links if l.endswith('.com.br')]


# Example call
links = get_links(tree)

这篇关于用python请求解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆