搜索时网页抓取网址未更改 [英] Web scraping url not changing while search
问题描述
我正在尝试抓网 https://in.udacity.com/courses/all.输入搜索查询时,我需要获取显示的课程.例如:如果我输入python,结果将有17门课程.我只需要获取这些课程.在这里,搜索查询不作为url的一部分传递.(不是get方法).因此html内容也没有改变.然后,如何在不遍历整个课程列表的情况下获取这些结果. 在这段代码中,我正在获取所有课程链接,以获取其中的内容并搜索该内容中的搜索词.但是,这并没有给我期望的结果.
I am trying to webscrape https://in.udacity.com/courses/all. I need to get the courses shown while entering the search query. For eg: if I enter python, there are 17 courses coming as results.I need to fetch those courses only. Here the search query is not passed as part of the url.(not get method).so the html content is also not changing. Then how can I fetch those results without going through the entire course list. in this code i am fetching all the course links getting the content of it and seraching the search term in that content.but it is not giving me the result that i expect.
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
from urllib.request import Request, urlopen
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')
search_term = input("enter the course:")
for link in courses:
#print("https://in.udacity.com" + link['href'])
html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()
if search_term in text_from_html(html).lower():
print('\n'+link.text)
print("https://in.udacity.com" + link['href'])
推荐答案
使用请求和 BeautifulSoup :
import requests
from bs4 import BeautifulSoup
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
for course in courses:
print(course.text)
输出:
VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.
正如@Martin Evans所解释的那样,搜索背后的Ajax调用没有按照您的想法进行,它可能会保留搜索的数量,即有多少用户搜索了AI 正在根据search_term
中的关键字过滤搜索:
As explainged by @Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term
:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"
for course in courses:
if re.search(search_term, course.text, re.IGNORECASE):
print(course.text)
输出:
AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems
这篇关于搜索时网页抓取网址未更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!