如何使用 C# 从 HTML 中获取纯文本? [英] How to get only plain text from HTML using C#?

查看:94
本文介绍了如何使用 C# 从 HTML 中获取纯文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好.


我正在尝试创建一个应用程序,它会在字符串中找到最常用的词.就我而言,字符串是 HTML.我已经可以从 URI 获取 HTML.例如https://www.bbc.com/news/world-middle-east-57327591".


var url = "https://www.bbc.com/news/world-middle-east-57327591";var httpClient = new HttpClient();var html = await httpClient.GetStringAsync(url);


Html 变量具有与源代码中相同的 HTML.那就好.

但是如何摆脱所有样式、脚本和附加信息.并在某些字符串变量中只获取纯文本?

我希望我的应用程序不仅适用于 BBC html,而且适用于我可以在网上获得的每个 HTML.我有一个想法,我应该从每个元素中获取文本,例如 <div>,<p>,<b>,<i>,<a> 因为不是所有的

中的文本存储.

解决方案

根据 这个的回答,请尝试以下操作:

<预><代码>var url = "https://www.bbc.com/news/world-middle-east-57327591";var httpClient = new HttpClient();var html = await httpClient.GetStringAsync(url);//创建一个选择所有html标签元素的正则表达式模式字符串模式 = @"<(.|\n)*?>";//将所有使用该正则表达式找到的标签元素替换为空return Regex.Replace(htmlString, pattern, string.Empty);

Hi guys.


I'm trying to create an app that will find the most frequently used words in the string. In my case, a string is the HTML. I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);


Html variable has the same HTML as in the Source. That's well.

But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?

I want my application not to be only for BBC html, but for every HTML which I can get in the net. I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p>.

解决方案

As per This answer, try the following:


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = @"<(.|\n)*?>";
//Replace all tag elements found using that regex with  nothing 
return Regex.Replace(htmlString, pattern, string.Empty);

这篇关于如何使用 C# 从 HTML 中获取纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆