使用Go解析HTML中的列表项 [英] Parsing list items from html with Go

查看:322
本文介绍了使用Go解析HTML中的列表项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Go提取所有列表项(每个< li>< / li> 的内容)。我应该使用regexp来获取< li> 项目吗?还是有其他库吗?



My意图是在Go中获取包含来自特定html网页的所有列表项的列表或数组。我应该怎么做?

解决方案

您可能想要使用 golang.org/x/net/html软件包
它不在Go标准软件包中,而是在转到子版本库中。 (子库是Go项目的一部分,但是在Go Go树之外,它们是在比Go内核更宽松的兼容性需求下开发的。)

示例,该文档中可能与您想要的类似。 / p>

如果您因为某些原因需要使用Go标准包,那么您可以使用
$ b $ b

这两个软件包倾向于使用 io.Reader 来输入。如果你有一个字符串 [] byte 变量,你可以用 strings.NewReader bytes.Buffer 得到 io.Reader



对于HTML,您更可能来自 http.Response body
(请务必在完成时关闭它) 。
也许类似于:

  resp,err:= http.Get(someURL)
if err! = nil {
return err
}
defer resp.Body.Close()

doc,err:= html.parse(resp.Body)
if err!= nil {
return err
}
//递归访问解析树中的节点
var f func(* html.Node)
f = func (n * html.Node){
if n.Type == html.ElementNode&& n.Data ==a{_ b $ b for _,a:= range n.Attr {
if a.Key ==href{
fmt.Println(a.Val)
break
}
}
}
for c:= n.FirstChild; c!= nil; c = c.NextSibling {
f(c)
}
}
f(doc)
}

当然,解析抓取的网页对于在客户端使用JavaScript修改自己内容的页面不起作用。


I want to extract all list items (content of each <li></li>) with Go. Should I use regexp to get the <li> items or is there any other library for this?

My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?

解决方案

You likely want to use the golang.org/x/net/html package. It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)

There is an example in that documentation that may be similar to what you want.

If you need to stick with the Go standard packages for some reason, then for "typical HTML" you can use encoding/xml.

Both packages tend to use an io.Reader for input. If you have a string or []byte variable you can wrap them with strings.NewReader or bytes.Buffer to get an io.Reader.

For HTML it's more likely you'll come from an http.Response body (make sure to close it when done). Perhaps something like:

    resp, err := http.Get(someURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    doc, err := html.parse(resp.Body)
    if err != nil {
        return err
    }
    // Recursively visit nodes in the parse tree
    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    fmt.Println(a.Val)
                    break
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    f(doc)
}

Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.

这篇关于使用Go解析HTML中的列表项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆