使用Go解析HTML中的列表项 [英] Parsing list items from html with Go
问题描述
我想用Go提取所有列表项(每个< li>< / li>
的内容)。我应该使用regexp来获取< li>
项目吗?还是有其他库吗?
My意图是在Go中获取包含来自特定html网页的所有列表项的列表或数组。我应该怎么做?
您可能想要使用 golang.org/x/net/html软件包。
它不在Go标准软件包中,而是在转到子版本库中。 (子库是Go项目的一部分,但是在Go Go树之外,它们是在比Go内核更宽松的兼容性需求下开发的。)
示例,该文档中可能与您想要的类似。 / p>
如果您因为某些原因需要使用Go标准包,那么您可以使用
$ b $ b
这两个软件包倾向于使用 io.Reader
来输入。如果你有一个字符串
或 [] byte
变量,你可以用 strings.NewReader
或 bytes.Buffer
得到 io.Reader
。
对于HTML,您更可能来自 http.Response
body
(请务必在完成时关闭它) 。
也许类似于:
resp,err:= http.Get(someURL)
if err! = nil {
return err
}
defer resp.Body.Close()
doc,err:= html.parse(resp.Body)
if err!= nil {
return err
}
//递归访问解析树中的节点
var f func(* html.Node)
f = func (n * html.Node){
if n.Type == html.ElementNode&& n.Data ==a{_ b $ b for _,a:= range n.Attr {
if a.Key ==href{
fmt.Println(a.Val)
break
}
}
}
for c:= n.FirstChild; c!= nil; c = c.NextSibling {
f(c)
}
}
f(doc)
}
当然,解析抓取的网页对于在客户端使用JavaScript修改自己内容的页面不起作用。
I want to extract all list items (content of each <li></li>
) with Go. Should I use regexp to get the <li>
items or is there any other library for this?
My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?
You likely want to use the golang.org/x/net/html package. It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)
There is an example in that documentation that may be similar to what you want.
If you need to stick with the Go standard packages for some reason, then
for "typical HTML" you can use encoding/xml
.
Both packages tend to use an io.Reader
for input. If you have a string
or []byte
variable you can wrap them with strings.NewReader
or bytes.Buffer
to get an io.Reader
.
For HTML it's more likely you'll come from an http.Response
body
(make sure to close it when done).
Perhaps something like:
resp, err := http.Get(someURL)
if err != nil {
return err
}
defer resp.Body.Close()
doc, err := html.parse(resp.Body)
if err != nil {
return err
}
// Recursively visit nodes in the parse tree
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
fmt.Println(a.Val)
break
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
}
Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.
这篇关于使用Go解析HTML中的列表项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!