我如何确定自己在特定网站的“关于我们"页面上 [英] How can i make sure that i am on About us page of a particular website
问题描述
这是一小段代码,我正在尝试使用这些代码来检索给定主页URL的网站上的所有链接.
Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage.
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
结果是
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
我只想获取网站的关于我们"页面的网址,在许多情况下,该页面会有所不同
I want to get the URL of only "about us" page of a website which differs in many cases like
对于Udacity,它是 https://www.udacity.com/us
for Udacity it is https://www.udacity.com/us
对于artscape-inc,它是 https://www.artscape-inc.com/about-decorative-window-film/
For artscape-inc it is https://www.artscape-inc.com/about-decorative-window-film/
我的意思是,我可以尝试在URL中搜索诸如关于"之类的关键字,但是正如我所说的那样,我可能会错过这种胆大的做法.有人可以建议任何好的方法吗?
I mean, i could try searching for keywords like "about" in the URLs but as said i might have missed udacity in this approach. Could anyone suggest any good approach?
推荐答案
要覆盖关于我们"页面链接的所有可能的变体并不容易,但这是在两种情况下都适用的初始思路已显示-检查 href
属性内的关于"和 a
元素的文本:
It would not be easy to cover every possible variation of an "About us" page link, but here is the initial idea that would work in both cases you've shown - check for "about" inside the href
attribute and the text of a
elements:
def about_links(elm):
return elm.name == "a" and ("about" in elm["href"].lower() or \
"about" in elm.get_text().lower())
用法:
soup.find_all(about_links) # or soup.find(about_links)
您也可以减少误报次数的方法是仅检查页面的页脚"部分.例如.找到 footer
元素,或具有 id ="footer"
或具有 footer
类的元素.
What you can also do to decrease the number of false positives is to check "footer" part of the page only. E.g. find footer
element, or an element with id="footer"
or having a footer
class.
将关于我们"页面定义外包"的另一种想法是到谷歌(当然是从您的脚本)关于" +网页网址",并获取第一个搜索结果.
Another idea to sort of "outsource" the "about us" page definition, would be to google (from your script, of course) "about" + "webpage url" and grab the first search result.
作为旁注,我注意到您仍在使用 BeautifulSoup
版本3 -尚未开发和维护,您应该切换到
As a side note, I've noticed you are still using BeautifulSoup
version 3 - it is not being developed and maintained and you should switch to BeautifulSoup
4 as soon as possible, install it via:
pip install --upgrade beautifulsoup4
并将导入更改为:
from bs4 import BeautifulSoup
这篇关于我如何确定自己在特定网站的“关于我们"页面上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!