在给定html中的所有标签之间获取文本并递归地通过链接 [英] Getting text between all tags in a given html and recursively going through links

查看:115
本文介绍了在给定html中的所有标签之间获取文本并递归地通过链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经检查了关于获取所有html标签之间所有单词的堆栈溢出的几篇文章!他们都把我弄糊涂了!有些人建议正则表达式专门为单个标签,而一些人提到了解析技术!我基本上试图做一个网络爬虫!因为我已经得到了链接的html,我用一个字符串获取了我的程序!我也提取了我存储在我的数据字符串中的链接!现在我想爬过深度并从我的字符串中提取的所有链接的页面上提取单词!我有两个问题!我如何获取每个网页上的字忽略标签和Java脚本?其次,我将如何递归爬行通过链接?



这是如何获得字符串中的html:

  public void getting_html_code_of_link()
{
string urlAddress =http://google.com;

HttpWebRequest request =(HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
if(response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if(response.CharacterSet == null)
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream,Encoding.GetEncoding(response.CharacterSet));
data = readStream.ReadToEnd();
response.Close();
readStream.Close();
Console.WriteLine(data);


$ / code $ / pre

这是如何从URL中提取链接引用我给:

  public void regex_ka_kaam()
{
StringBuilder sb = new StringBuilder();
// Regex hrefs = new Regex(< a href。*?>);
Regex http = new Regex(http://.*?>);

foreach(匹配http.Matches(data)中的m)
{
sb.Append(m.ToString());
if(http.IsMatch(m.ToString()))
{

sb.Append(http.Match(m.ToString()));
sb.Append();
//sb.Append(\"<br>);
}
else
{
sb.Append(m.ToString()。Substring(1,m.ToString()。Length - 1)); // +< br>);
}
}
Console.WriteLine(sb);


解决方案

正则表达式不是一个好选择用于解析HTML文件。



HTML格式不严格,格式不规则。



使用 htmlagilitypack






这将从网页中提取所有链接。 public list< string> getAllLinks(string webAddress)
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument newdoc = web.Load(webAddress);

return doc.DocumentNode.SelectNodes(// a [@href])
.Where(y => y.Attributes [href]。Value.StartsWith( http))
.Select(x => x.Attributes [href]。Value)
.ToList< string>();
}






获取所有内容不包括html中的标签

pre $ $ $ c $ public string getContent(string webAddress)
{
HtmlAgilityPack.HtmlWeb web =新的HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(webAddress);

return string.Join(,doc.DocumentNode.Descendants()。Select(x => x.InnerText));
}






这遍历所有链接

  public void crawl(string seedSite)
{
getContent(seedSite); //获取全部内容
getAllLinks(seedSite); //获取所有链接
}


i have checked a couple of posts on stack overflow regarding getting all the words between all the html tags! All of them confused me up! some people recommend regular expression specifically for a single tag while some have mentioned parsing techniques! am basically trying to make a web crawler! for that i have got the html of the link i fetched to my program in a string! i have also extracted the links from the html that i stored in my data string! now i want to crawl through the depth and extract words on the page of all links i extracted from my string! i got two questions! how can i fetch the words on the each of the web pages ignoring tags and java script? secondly how would i recursively crawl through the links??

This is how am getting html in the string:

public void getting_html_code_of_link()
    {
        string urlAddress = "http://google.com";

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;
            if (response.CharacterSet == null)
                readStream = new StreamReader(receiveStream);
            else
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            data = readStream.ReadToEnd();
            response.Close();
            readStream.Close();
            Console.WriteLine(data);
        }
    }

and this is how am extracting link refrences from the url i give:

public void regex_ka_kaam()
    {
        StringBuilder sb = new StringBuilder();
        //Regex hrefs = new Regex("<a href.*?>");
        Regex http = new Regex("http://.*?>");

        foreach (Match m in http.Matches(data))
        {
            sb.Append(m.ToString());
            if (http.IsMatch(m.ToString()))
            {

                sb.Append(http.Match(m.ToString()));
                sb.Append("                                                                        ");
                //sb.Append("<br>");
            }
            else
            {
                sb.Append(m.ToString().Substring(1, m.ToString().Length - 1)); //+ "<br>");
            }
        }
        Console.WriteLine(sb);
    }

解决方案

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack


This extracts all the links from the web page

public List<string> getAllLinks(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument newdoc=web.Load(webAddress);

    return doc.DocumentNode.SelectNodes("//a[@href]")
              .Where(y=>y.Attributes["href"].Value.StartsWith("http"))
              .Select(x=>x.Attributes["href"].Value)
              .ToList<string>();
}


this gets all the content excluding tags in the html

public string getContent(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc=web.Load(webAddress);

    return string.Join(" ",doc.DocumentNode.Descendants().Select(x=>x.InnerText));
}


this crawls through all the links

public void crawl(string seedSite)
{
        getContent(seedSite);//gets all the content
        getAllLinks(seedSite);//get's all the links
}

这篇关于在给定html中的所有标签之间获取文本并递归地通过链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆