搜索时网页抓取网址未更改 [英] Web scraping url not changing while search

查看:75
本文介绍了搜索时网页抓取网址未更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓网 https://in.udacity.com/courses/all.输入搜索查询时,我需要获取显示的课程.例如:如果我输入python,结果将有17门课程.我只需要获取这些课程.在这里,搜索查询不作为url的一部分传递.(不是get方法).因此html内容也没有改变.然后,如何在不遍历整个课程列表的情况下获取这些结果. 在这段代码中,我正在获取所有课程链接,以获取其中的内容并搜索该内容中的搜索词.但是,这并没有给我期望的结果.

I am trying to webscrape https://in.udacity.com/courses/all. I need to get the courses shown while entering the search query. For eg: if I enter python, there are 17 courses coming as results.I need to fetch those courses only. Here the search query is not passed as part of the url.(not get method).so the html content is also not changing. Then how can I fetch those results without going through the entire course list. in this code i am fetching all the course links getting the content of it and seraching the search term in that content.but it is not giving me the result that i expect.

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
from urllib.request import Request, urlopen

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')

search_term = input("enter the course:")
for link in courses:
    #print("https://in.udacity.com" + link['href'])
    html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()

    if search_term in text_from_html(html).lower():
        print('\n'+link.text)
        print("https://in.udacity.com" + link['href'])

推荐答案

使用请求 BeautifulSoup :

import requests
from bs4 import BeautifulSoup

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")

for course in courses:
    print(course.text)

输出:

VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.

正如@Martin Evans所解释的那样,搜索背后的Ajax调用没有按照您的想法进行,它可能会保留搜索的数量,即有多少用户搜索了AI 正在根据search_term中的关键字过滤搜索:

As explainged by @Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term:

import requests
from bs4 import BeautifulSoup
import re

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"

for course in courses:
    if re.search(search_term, course.text, re.IGNORECASE):
        print(course.text)

输出:

AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems

这篇关于搜索时网页抓取网址未更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆