我如何确定自己在特定网站的“关于我们"页面上 [英] How can i make sure that i am on About us page of a particular website

查看:54
本文介绍了我如何确定自己在特定网站的“关于我们"页面上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一小段代码,我正在尝试使用这些代码来检索给定主页URL的网站上的所有链接.

Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage.

import requests
from BeautifulSoup import BeautifulSoup

url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))


def getURL(page):

    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print url
    else:
        break

结果是

/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity

Process finished with exit code 0

我只想获取网站的关于我们"页面的网址,在许多情况下,该页面会有所不同

I want to get the URL of only "about us" page of a website which differs in many cases like

对于Udacity,它是 https://www.udacity.com/us

for Udacity it is https://www.udacity.com/us

对于artscape-in​​c,它是 https://www.artscape-in​​c.com/about-decorative-window-film/

For artscape-inc it is https://www.artscape-inc.com/about-decorative-window-film/

我的意思是,我可以尝试在URL中搜索诸如关于"之类的关键字,但是正如我所说的那样,我可能会错过这种胆大的做法.有人可以建议任何好的方法吗?

I mean, i could try searching for keywords like "about" in the URLs but as said i might have missed udacity in this approach. Could anyone suggest any good approach?

推荐答案

要覆盖关于我们"页面链接的所有可能的变体并不容易,但这是在两种情况下都适用的初始思路已显示-检查 href 属性内的关于"和 a 元素的文本:

It would not be easy to cover every possible variation of an "About us" page link, but here is the initial idea that would work in both cases you've shown - check for "about" inside the href attribute and the text of a elements:

def about_links(elm):
    return elm.name == "a" and ("about" in elm["href"].lower() or \
                                "about" in elm.get_text().lower())

用法:

soup.find_all(about_links)  # or soup.find(about_links)

您也可以减少误报次数的方法是仅检查页面的页脚"部分.例如.找到 footer 元素,或具有 id ="footer" 或具有 footer 类的元素.

What you can also do to decrease the number of false positives is to check "footer" part of the page only. E.g. find footer element, or an element with id="footer" or having a footer class.

将关于我们"页面定义外包"的另一种想法是到谷歌(当然是从您的脚本)关于" +网页网址",并获取第一个搜索结果.

Another idea to sort of "outsource" the "about us" page definition, would be to google (from your script, of course) "about" + "webpage url" and grab the first search result.

作为旁注,我注意到您仍在使用 BeautifulSoup 版本3 -尚未开发和维护,您应该切换到

As a side note, I've noticed you are still using BeautifulSoup version 3 - it is not being developed and maintained and you should switch to BeautifulSoup 4 as soon as possible, install it via:

pip install --upgrade beautifulsoup4

并将导入更改为:

from bs4 import BeautifulSoup

这篇关于我如何确定自己在特定网站的“关于我们"页面上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆