使用WebBrowser复制网站文本失败 [英] Copying website text using WebBrowser failed

查看:297
本文介绍了使用WebBrowser复制网站文本失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 WebBrowser 类复制网站的文本(从用户那里获取URL),但似乎没有任何线程在运行.我还尝试了在没有线程的情况下使用 WebBrowser ,但这没有用.任何建议都将受到欢迎.这是我第一次使用这些库,非常感谢您帮助我获得想要的东西.

I am trying to copy the text of a website(getting the URL from the user) using the WebBrowser class but it seems that none of the thread lines are running. I also tried using WebBrowser without the thread but it didn't work. Any advice will be welcome. It's my first time with these libraries so many thanks for helping me get what I want.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Xml;
using System.Windows.Forms;
using System.Threading;

public partial class _Default : Page
{
protected void Page_Load(object sender, EventArgs e)
{

}
private void runBrowserThread(Uri url)
{
    var th = new Thread(() => {
        var br = new WebBrowser();
        br.DocumentCompleted += browser_DocumentCompleted;
        br.Navigate(url);
        global::System.Windows.Forms.Application.Run();
        object n = new object();
        br.Document.ExecCommand("SelectAll",true,n);
        br.Document.ExecCommand("Copy",true,n);
        string text = Clipboard.GetText();
        MessageBox.Show(text, "Text");
    });
    th.SetApartmentState(ApartmentState.STA);
    th.Start();
}

void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    var br = sender as WebBrowser;
    if (br.Url == e.Url)
    {
        Console.WriteLine("Natigated to {0}", e.Url);
      // global::System.Windows.Forms.Application.ExitThread();   // Stops the thread
    }
}


public void url_input_Click(Object sender, EventArgs e)
{
    string StringFromTheInput = TextBox1.Text;
    System.Uri uri = new System.Uri(StringFromTheInput);
    runBrowserThread(uri);
}

public static Dictionary<string, int> WordCount(string content, int numWords = int.MaxValue)
{
    var delimiterChars = new char[] { ' ', ',', ':', '\t', '\"', '\r', '{', '}', '[', ']', '=', '/' };
    return content
        .Split(delimiterChars)
        .Where(x => x.Length > 0)
        .Select(x => x.ToLower())
        .GroupBy(x => x)
        .Select(x => new { Word = x.Key, Count = x.Count() })
        .OrderByDescending(x => x.Count)
        .Take(numWords)
        .ToDictionary(x => x.Word, x => x.Count);
}
}

推荐答案

来自注释-如何从页面的HTML中提取实际内容.

From comments - How to extract the actual content from an HTML from a page.

修改

与Israel Nehes讨论了该问题之后,看来解决方案是检索特定的标签值.

After discussing the issue with Israel Nehes it appears that the solution was to retrieve particular tags values.

我已经更新了代码,希望这会有所帮助.

I have updated the code, hopefully this will help.

检索HTML,然后使用XPath路径表达式检索您感兴趣的节点,该节点将是

和标记

Retrieve the HTML,then using XPath Path Expressions you can retrieve the node that you are interested in which would be the

and tags

   static public StringBuilder Content { get; set; }
    static void Main(string[] args)
    {
        string html;
        Content = new StringBuilder();
        string url = @"https://www.msn.com/en-gb/news/uknews/universal-credit-forcing-families-to-wait-months-for-help-to-pay-childcare-bills-mps-warn/ar-BBRjFtR?li=BBoPRmx";
        WebClient wc = new WebClient();
        HtmlDocument doc = new HtmlDocument();

        html = wc.DownloadString(url);
        doc.LoadHtml(html);

        var allP = doc.DocumentNode.SelectNodes("//p");
        var allLink = doc.DocumentNode.SelectNodes("//a");
        foreach (var p in allP)
        {
            var outerHtml = p.OuterHtml;
            List<string> potentialContent = Regex.Replace(outerHtml, "<[^>]*>", "").Split(' ').ToList();

            if (potentialContent.Count > 1)
            {
                Content.Append(new StringBuilder(string.Join(" ", potentialContent)));
            }
        }

        foreach (var p in allLink)
        {
            var outerHtml = p.OuterHtml;
            List<string> potentialContent = Regex.Replace(outerHtml, "<[^>]*>", "").Split(' ').ToList();

            if (potentialContent.Count > 1)
            {
                Content.Append(new StringBuilder(string.Join(" ", potentialContent)));
            }
        }
    }

内容属性将包含标签值.

The property Content will contain the tags values.

这篇关于使用WebBrowser复制网站文本失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆