如何从网站conatct页面获取只有公司地址块 [英] How to get only company address block from website conatct page

查看:79
本文介绍了如何从网站conatct页面获取只有公司地址块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从网站联系页面获取公司地址块



i试过这个..



How to get only company address block from website conatct page

i have tried this..

public void Extract_all_text_from_webpage(string filename)
{
    HtmlDocument document = new HtmlDocument();
    document.Load(new MemoryStream(File.ReadAllBytes(filename)));
    textBox1.Text += Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode));
   // if (_addressDictionaries.AddressDictDuplicates.Contains(ExtractViewableTextCleaned(document.DocumentNode)))
    {
        listBox1.Items.Add(Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode)));
    }
}

public static string ExtractViewableTextCleaned(HtmlNode node)
{
    string textWithLotsOfWhiteSpaces = ExtractViewableText(node);
    return _removeRepeatedWhitespaceRegex.Replace(textWithLotsOfWhiteSpaces, " ").Replace(" ","").Replace("©","");
}

public static string ExtractViewableText(HtmlNode node)
{
    StringBuilder sb = new StringBuilder();
    ExtractViewableTextHelper(sb, node);
    return sb.ToString();
}

private static void ExtractViewableTextHelper(StringBuilder sb, HtmlNode node)
{
    if (node.Name != "script" && node.Name != "style" && node.Name!="a")
    {
        if (node.NodeType == HtmlNodeType.Text)
        {
            AppendNodeText(sb, node);
        }

        foreach (HtmlNode child in node.ChildNodes)
        {
            ExtractViewableTextHelper(sb, child);
        }
    }
}

private static void AppendNodeText(StringBuilder sb, HtmlNode node)
{
    string text = ((HtmlTextNode)node).Text;
    if (string.IsNullOrWhiteSpace(text) == false)
    {
        sb.Append(Environment.NewLine + text);

        // If the last char isn't a white-space, add a white space
        // otherwise words will be added ontop of each other when they're only separated by
        // tags
        if (text.EndsWith("\t") || text.EndsWith("\n") || text.EndsWith(" ") || text.EndsWith("\r"))
        {
            // We're good!
        }
        else
        {
            sb.Append(" ");
        }
    }
}

推荐答案

谁知道什么是错的? - 因为这取决于内容严重,几乎硬编码,我们没有看到内容的样本。如果内容发生变化,这样的代码可能太脆弱了,即使变化是装饰性的。



可能,少用临时方法可以帮助你。如果你从一些HTTP解析器开始,你会受益很多。如果HTTP是格式良好的XML,这是微不足道的,因为.NET有足够多的XML库。但如果不是该怎么办?您可能需要一个可以容忍缺乏格式良好的内容的解析器。我建议你看看这个:

http://www.majestic12.co .uk / projects / html_parser.php [ ^ ]。



-SA
Who knows what''s wrong? — because this depends on the content heavily, nearly hard-coded, and we don''t see the sample of the content. Such code might be too fragile if something changes in the content, even if the change is decorative.

Probably, having less of ad-hoc approach could help you. You would benefit much is you start from some HTTP parser. If HTTP is well-formed XML, this is trivial, as .NET has more then enough XML libraries. But what to do if it is not? You may need a parser which can tolerate the lack of well-formed content. I would advise to look at this one:
http://www.majestic12.co.uk/projects/html_parser.php[^].

—SA


这篇关于如何从网站conatct页面获取只有公司地址块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆