从相对路径解析绝​​对路径 [英] Resolving absolute path from relative path

查看:192
本文介绍了从相对路径解析绝​​对路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个网络爬虫,我试图找出找出相对路径的绝对路径的方法。
我参加了2个考点。一个在ROR中,另一个在Pyro CMS中制作。

在后者中,我找到了链接为index.php的href标签。所以,如果我目前在 http://example.com/xyz 上爬行,那么我的抓取工具将追加并使其成为 http:// example.com/xyz/index.php 。但问题是,我应该追加到根目录,即它应该是 http://example.com/index.php 。因此,如果我抓取 http://example.com/xyz/index.php ,我会找到另一个index.php,它会被再次追加。



在ROR中,如果相对路径以'/'开头,我可以很容易地知道它是一个根站点。

我可以处理index.php的情况,但是如果我手动开始执行,可能会有很多规则需要处理。我相信有一个更简单的方法来完成这件事。

解决方案

在Go中,package path 是您的朋友。



您可以从路径中获取目录或文件夹使用 path.Dir() ,例如

  p:=/xyz/index.php
dir:= path.Dir(p)
fmt.Println(dir:,dir)//输出:/ xyz

如果你发现一个带有根路径的链接(以斜线开头),你可以直接使用它。



如果它是相对的,你可以将它加入 dir 使用 路径。加入() Join()也会清理网址:

  p2: = path.Join(dir,index.php)
fmt.Println(p2:,p2)
p3:= path.Join(dir,./index.php)
fmt.Println(p3:,p3)
p4:= path.Join(dir,../index.php)
fmt.Println(p4:,p4 )

输出:

  p2:/xyz/index.php 
p3:/xyz/index.php
p4:/index.php

path.Join()执行的清理任务由 path.Clean() ,您可以手动调用任何路径课程。它们是:
$ b



  1. 用单斜杠替换多个斜线。

  2. 消除每个路径名称元素(当前目录)。
  3. 消除每个内部 ..
  4. code>路径名元素(父目录)以及它之前的非 .. 元素。
  5. 消除 .. 开始根路径的元素:也就是用/ .. >/。


如果您有完整网址(包括架构,主机等),则可以使用 url.Parse() 函数获取 url.URL 从原始url字符串中为你标记url,所以你可以得到这样的路径:

  uraw:=http://example.com/xyz/index.php 
u,err:= url.Parse(uraw)
if err!= nil {
fmt.Println(Invalid url:,err)
}
fmt .Println(Path:,u.Path)

输出:

 路径:/xyz/index.php 

尝试去游乐场的所有例子。


I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS.

In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz, then my crawler will append and make it http://example.com/xyz/index.php. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php. So if I crawl http://example.com/xyz/index.php, I'll find another "index.php" which gets appended again.

While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.

I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.

解决方案

In Go, package path is your friend.

You can get the directory or folder from a path with path.Dir(), e.g.

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"

If you find a link with root path (starts with a slash), you can use that as-is.

If it is relative, you can join it with the dir above using path.Join(). Join() will also "clean" the url:

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

Output:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

The "cleaning" tasks performed by path.Join() are done by path.Clean() which you can manually call on any path of course. They are:

  1. Replace multiple slashes with a single slash.
  2. Eliminate each . path name element (the current directory).
  3. Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
  4. Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path.

And if you have a "full" url (with schema, host, etc.), you can use the url.Parse() function to obtain a url.URL value from the raw url string which tokenizes the url for you, so you can get the path like this:

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
    fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)

Output:

Path: /xyz/index.php

Try all the examples on the Go Playground.

这篇关于从相对路径解析绝​​对路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆