从锚标签读取,提取href并扫描标签以获取特定的值/位置 [英] Read, extract href from anchor tags and scan tag for a particular value/position

查看:99
本文介绍了从锚标签读取,提取href并扫描标签以获取特定的值/位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从下面的数据文件中读取HTML,从定位标记中提取href = vaule,扫描相对于列表中第一个名称处于特定位置的标记,然后点击该链接并重复该过程多次,并报告我找到的姓氏:

I am trying to read the HTML from the data files below, extract the href= vaules from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name I find:

(URL: http://py4e-data.dr-chuck.net/known_by_Emir.html )

在位置18(名字是1)处找到链接.点击该链接.重复此过程7次.答案是您检索到的姓氏. 提示:您要加载的最后一页名称的第一个字符是:M

Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: M

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

推荐答案

您可以将整个内容放在一个循环中:

You can put the whole thing within a loop:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'http://py4e-data.dr-chuck.net/known_by_Emir.html'

for x in range(7):
    if x != 0:
        html = urllib.request.urlopen(tag, context=ctx).read()
    else:
        html = urllib.request.urlopen(url, context=ctx).read()

    soup = BeautifulSoup(html, 'html.parser')
    tag = soup.find_all('a')[17]['href']

print(tag)

输出:

http://py4e-data.dr-chuck.net/known_by_Maya.html

这篇关于从锚标签读取,提取href并扫描标签以获取特定的值/位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆