从相对路径解析绝对路径 [英] Resolving absolute path from relative path
问题描述
我正在制作一个网络爬虫,我试图找出找出相对路径的绝对路径的方法。
我参加了2个考点。一个在ROR中,另一个在Pyro CMS中制作。
在后者中,我找到了链接为index.php的href标签。所以,如果我目前在 http://example.com/xyz
上爬行,那么我的抓取工具将追加并使其成为 http:// example.com/xyz/index.php
。但问题是,我应该追加到根目录,即它应该是 http://example.com/index.php
。因此,如果我抓取 http://example.com/xyz/index.php
,我会找到另一个index.php,它会被再次追加。
在ROR中,如果相对路径以'/'开头,我可以很容易地知道它是一个根站点。
我可以处理index.php的情况,但是如果我手动开始执行,可能会有很多规则需要处理。我相信有一个更简单的方法来完成这件事。
在Go中,package path
是您的朋友。
您可以从路径中获取目录或文件夹使用 path.Dir()
,例如
p:=/xyz/index.php
dir:= path.Dir(p)
fmt.Println(dir:,dir)//输出:/ xyz
如果你发现一个带有根路径的链接(以斜线开头),你可以直接使用它。
如果它是相对的,你可以将它加入 dir
使用 路径。加入()
。 Join()
也会清理网址:
p2: = path.Join(dir,index.php)
fmt.Println(p2:,p2)
p3:= path.Join(dir,./index.php)
fmt.Println(p3:,p3)
p4:= path.Join(dir,../index.php)
fmt.Println(p4:,p4 )
输出:
p2:/xyz/index.php
p3:/xyz/index.php
p4:/index.php
由 path.Join()
执行的清理任务由 path.Clean()
,您可以手动调用任何路径课程。它们是:
$ b
- 用单斜杠替换多个斜线。
- 消除每个
。
路径名称元素(当前目录)。
- 消除每个内部
code>路径名元素(父目录)以及它之前的非..
..
元素。
- 消除
..
开始根路径的元素:也就是用/ ..
>/。
如果您有完整网址(包括架构,主机等),则可以使用 url.Parse()
函数获取 url.URL
从原始url字符串中为你标记url,所以你可以得到这样的路径:
uraw:=http://example.com/xyz/index.php
u,err:= url.Parse(uraw)
if err!= nil {
fmt.Println(Invalid url:,err)
}
fmt .Println(Path:,u.Path)
输出:
路径:/xyz/index.php
尝试去游乐场的所有例子。
I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS.
In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz
, then my crawler will append and make it http://example.com/xyz/index.php
. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php
. So if I crawl http://example.com/xyz/index.php
, I'll find another "index.php" which gets appended again.
While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.
I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.
In Go, package path
is your friend.
You can get the directory or folder from a path with path.Dir()
, e.g.
p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"
If you find a link with root path (starts with a slash), you can use that as-is.
If it is relative, you can join it with the dir
above using path.Join()
. Join()
will also "clean" the url:
p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)
Output:
p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php
The "cleaning" tasks performed by path.Join()
are done by path.Clean()
which you can manually call on any path of course. They are:
- Replace multiple slashes with a single slash.
- Eliminate each
.
path name element (the current directory).- Eliminate each inner
..
path name element (the parent directory) along with the non-..
element that precedes it.- Eliminate
..
elements that begin a rooted path: that is, replace"/.."
by"/"
at the beginning of a path.
And if you have a "full" url (with schema, host, etc.), you can use the url.Parse()
function to obtain a url.URL
value from the raw url string which tokenizes the url for you, so you can get the path like this:
uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)
Output:
Path: /xyz/index.php
Try all the examples on the Go Playground.
这篇关于从相对路径解析绝对路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!