C#从Wiki页面获取Certian文本? [英] C# Get Only Certian Text From Wiki Page?

查看:89
本文介绍了C#从Wiki页面获取Certian文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好了,现在我已经让我的程序发言并且UI没有冻结(非常感谢Marcin Kozub),我现在需要知道如何从维基百科文章中检索文本。

我已经弄清楚如何做到这一点(我只是使用了一个网页浏览器控件并得到了 WebBrowserControl.Document.Body.InnerText 并说了这个)但是当我说文字,我听说它的小引文链接和导航栏

链接等等。所以我的问题是如何从我检索的文本中删除所有这些?

就像我不想要小引用所需的链接,导航栏链接,书签链接或编辑链接到请阅读。

这是我到目前为止:



  //   WIKI是我的网络浏览器控件的名称 
synth.SpeakAsync(WIKI.Document.Body.InnerText)



这只是从页面获取所有链接和所有内容的直接文本。

这里是我想读的维基链接:

http://en.wikipedia.org/wiki/Forza_Horizo​​n_2 [ ^ ]

和是我喜欢玩电子游戏,我喜欢玩forza游戏。

哦,怎么可以通过c#在维基百科上搜索?

最终目标是t o能够向程序询问问题或向我提供有关某事的信息,它会向我读一篇维基百科文章(有点像startrek计算机)

非常感谢任何帮助。



请和谢谢,

MasterCodeon

解决方案

你可以试试这个快速转换调用按钮的代码是C#

  if (WebBrowser.Url.ToString ==   http://www.Yoursite.com/&& WebBrowser.Document.Body.InnerText.Contains(  Wiki)==  true ){
if (WebBrowser.ReadyState == WebBrowserReadyState.Complete){
// 下面我们将文本框文本属性设置为已加载的网页文本字段,其中html字段的if为用户名
WebBrowser.Document.GetElementById( username)。SetAttribute( Value,textbox1.Text)
// < span class =code-comment>下面我们将textbox文本属性设置为加载的webpages文本字段,其中html字段的if是密码
WebBrowser.Document.GetElementById( 密码)。SetAttribute( Value,textbox2.Text);
}
// 这将允许您调用点击提供ID的搜索按钮html按钮是提交。您需要更改这些以反映Wikipedia源代码。
WebBrowser.Document.GetElementById( 提交)。InvokeMember( 点击);

}





您需要查看维基百科源代码(Html)并查看Div标签的内容ID和类名。即

 <   div     id   =  SomeId    classname   =  SomeClass > 您想要的文字<   / div  >  





您需要在WebBrowser.Document.Body.innerhtml中循环的那些类名和ID,以通过迭代该标记的类名来获取所需的文本。



您可以使用 GetElementsByTagName(a) 检索链接的html元素集合。 (以上将检索链接,因为它寻找的标签,如

 <   a     href   =  > 要获取的文字<   / a  >  

'a'是标记。 href'是类名属性。



如果你想循环通过Div标签,只需将'a'更改为div,然后更改相应于你想要获得的标签的类名。



然后你需要在迭代它时搜索该集合以查找'a的类名'标记使用: 主Element.GetAttribute(href)==http:// 如果匹配则返回链接。



从那里,您可以使用If语句来检查 .inntertext 不为null并设置元素到声明的变量,您可以使用返回的结果随意执行。



我只有在.Net写的代码,没有时间转换它,所以你需要将它转换为C#,但我也有为您提供了实现上述目标所需的链接,您可以尝试使用 Teleric 但我的猜测是你可能需要自己手动更改一些。



但是我希望这篇快速的文章能够让你对如何解决这个问题有一个全面的了解。



  Dim  MyString 正如 字符串 = 没什么 
Dim myElement =(来自MainElement As HtmlElement WebBrowser.document.GetElementsByTagName( a)。强制转换( HtmlElement)( )
其中MainElement.GetAttribute( href)= http://
选择 MainElement)
如果 myElement( 0 IsNo t Nothing 然后
myElement( 0 )。InnerText = MyString
结束 如果





希望它有所帮助。



编辑:



使用此解决方案可能对您有所帮助的链接值。



Agility Pack BillWoodruff推荐






我认为你可以使用维基百科API获得你想要的东西。

下面的链接将返回一个包含所需页面内容的XML文件:

http://en.wikipedia。组织/瓦特/ api.php格式= XML&安培;行动=查询&安培;丙=提取物&安培; titles = Forza%20Horizo​​n%202& redirects = true [ ^ ]



注意内容在输出XML里面有标准的HTML标签,比如'p','i','b'和'h2'。你可以解析这些内容并对特定标签采取一些操作,比如'h2',以便进行更大的停顿等。



我之前从未使用它并且没有测试但是我发现StackOverflow上有类似的问题:

http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api [ ^ ]



当然维基百科API链接:

http://www.mediawiki.org/wiki/API: Main_page [ ^ ]



[更新1]

对于此解决方案,您不需要WebBrowser来获取XML(如果您不需要显示Wiki页面)。只需使用以下代码:

  var  webClient =  new  WebClient(); 
var pageSourceCode = webClient.DownloadString( place_url_here);





然后使用XmlDocument访问节点。 codeproject.com上有很多问题:)



[更新2]

你提到你要在Wiki上搜索一些主题。我为你做了一些研究,使用Wiki API进行搜索相当容易。您只需要调用API网址:

http: //en.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizo​​n [ ^ ]



它会以类似的格式返回结果这个:

[forza horizo​​n,[Forza Horizo​​n,Forza Horizo​​n 2]]



这意味着搜索短语是两个结果。接下来,您可以显示结果并让用户指定他/她想要打开的结果。最后回顾维基内容并使用语音合成向用户输出内容:)



[更新3 - 最终我认为;)]

维基百科是多语言的,所以你可以在你的应用程序中做同样的事情。要获取特定语言的数据,请在链接的乞讨处更改短代码,即:

http://pl.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizo​​n [ ^ ]



您也可以更改合成器中的声音。迭代计算机上已安装的语音以获取有关它们的信息并显示可用语言。 VoiceInfo Culture 属性。我下载的示例应用程序包含了您需要执行此操作的所有内容。



干杯!



= ==编辑:修复断链===

CodingK



===编辑:修复断链===

Marcin Kozub


好的所以这里是我从Marcin Kozub的解决方案中得出的代码



  var  webClient =  new  WebClient(); 
var pageSourceCode = webClient.DownloadString( http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles= + Forza Horizo​​n 2 + & redirects = true );

XmlDocument doc = new XmlDocument();

doc.LoadXml(pageSourceCode);

var fnode = doc.GetElementsByTagName( extract)[ 0 ];

string ss = fnode.InnerText;

正则表达式regex = 正则表达式( \\的百分比抑制率^ \\>] * \\> 中);

String .Format( 之前: {0},ss); // HTML文字

ss = regex.Replace(ss,字符串 .Empty);

string result = String .Format(ss); // 纯文本作为输出



TextBox.Text + =结果;



i能够通过使用(并修改)此答案中的代码来获取我想要的xml节点:

从xml中存储特定节点 [ ^ ]

感谢大家的帮助!

Ok so now that i have gotten my program to speak something and the UI not freeze up (a big thanks to Marcin Kozub for that), I now need to know how to retrieve the text from a Wikipedia article.
I have figured out how to do that(i just used a web browser control and got the WebBrowserControl.Document.Body.InnerText and spoke that) but when I speak the text, I hear it speak the little citation links and navigation bar
links and so on so on. So my question is how do i remove all of that from the text i retrieve?
like i don't want the little citation needed links, the navigation bar links, the bookmark links or the edit links to be read.
here is what i have so far:

// WIKI is the name of my web browser control
synth.SpeakAsync(WIKI.Document.Body.InnerText)


this just gets the straight text from the page with all the links and everything.
here is the wiki link i am trying to read:
http://en.wikipedia.org/wiki/Forza_Horizon_2[^]
and yes I love to play video games and i love to play the forza games.
Oh and how might one do a search on Wikipedia through c#?
the end goal is to be able to ask the program a question or to give me information on something and it read a Wikipedia article to me(kind of like the startrek computers)
Any help is very much appreciated.

Please and Thank you,
MasterCodeon

解决方案

You can try this quickly converted code to invoke a button which is C#

if (WebBrowser.Url.ToString == "http://www.Yoursite.com/" && WebBrowser.Document.Body.InnerText.Contains("Wiki") == true) {
	if (WebBrowser.ReadyState == WebBrowserReadyState.Complete) {
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is username     
               WebBrowser.Document.GetElementById("username").SetAttribute("Value", textbox1.Text)
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is password
		WebBrowser.Document.GetElementById("password").SetAttribute("Value", textbox2.Text);
	}
//This will allow you to invoke click on a search button providing the ID of the html button is Submit. You will need to change these to reflect Wikipedia source code.
	WebBrowser.Document.GetElementById("Submit").InvokeMember("click");

}



You will need to look at Wikipedia Source code (Html) and see what Div tags have IDs and Class names. I.e.

<div id="SomeId" classname="SomeClass">Text you want</div>



Its those Classnames and IDs you need to loop through in the WebBrowser.Document.Body.innerhtml to get the text you need by iterating the class names of that tag.

You can use the GetElementsByTagName("a") to retrieve the html element collection of links. (The above would retrieve links since its looking for tags like

<a href="#">Text to get</a>

' a ' being the tag. ' href' being the class name attribute.

If you want to loop through Div tags, just change ' a ' to div, and change the class name accordingly relative to the tag you want to get.

You then need to cast that collection as you iterate through it to look for the Classname of the ' a ' tag using: MainElement.GetAttribute("href") == "http://" which will return the links if there is a match.

From there, you can use an If statement to check the .inntertext of the html element is not null and set the element to a declared variable where you can do as you please with the returned result.

I only have the code wrote out in .Net and don't have time to convert it, so you will need to convert it to C#, but I have also provided you with the links you need for achieving this above, and you can try convert some of the code with Teleric but my guess is you may need to manually change some of it yourself.

But I hope this quick post will give you a general insight how to approach this.

Dim MyString As String = Nothing
Dim myElement = (From MainElement As HtmlElement In WebBrowser.document.GetElementsByTagName("a").Cast(Of HtmlElement)()
                                             Where MainElement.GetAttribute("href") = "http://"
                                             Select MainElement)
        If myElement(0) IsNot Nothing Then
            myElement(0).InnerText = MyString
        End If



Hope it helps.

Edit:

Link worth checking which might also be helpful to you with this solution.

Agility Pack Recommended by BillWoodruff


Hi,

I think that you can use Wikipedia API to get what you want.
The link below will return a XML file with content of desired page:
http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=Forza%20Horizon%202&redirects=true[^]

Notice that the content in output XML has standard HTML tags inside, like 'p', 'i', 'b' and 'h2'. You can parse this content and take some actions on specific tags like 'h2' to make bigger pause etc.

I never used it before and didn't test it but there is similar question on StackOverflow I found:
http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api[^]

And ofcourse Wikipedia API link:
http://www.mediawiki.org/wiki/API:Main_page[^]

[Update 1]
For this solution you don't need WebBrowser to get XML (if you don't need to display wiki page). Simply use this code:

var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("place_url_here");



Then use XmlDocument to access nodes. There is many exapmles on codeproject.com :)

[Update 2]
You've mentioned that you want to search the Wiki for some topics. I did some research for you and using Wiki API for search is fairly easy. You just need to call API url:
http://en.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

It will return results in format like this:
["forza horizon",["Forza Horizon","Forza Horizon 2"]]

It means that for search phrase are two results. Next you can display results and let user to specify which result he/she wants to open. Finally retrive Wiki content for that and use Speech Synthesis to output content to the user :)

[Update 3 - Final I think ;)]
Wikipedia is multilingual, so you can do the same thing in your app. To get data in specific language change it short code at begging of the link i.e:
http://pl.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

You can change the voice in your synthesiser too. Iterate through installed voices on your computer to get information about them and display available languages. There is Culture property of VoiceInfo. My sample app you've downloaded contains everything you need to do that.

Cheers!

===EDIT: Fixed Broken Link===
CodingK

===EDIT: Fixed Broken Link ===
Marcin Kozub


ok so here is the code i came up with from Marcin Kozub's solution

var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=" + "Forza Horizon 2" + "&redirects=true");

XmlDocument doc = new XmlDocument();

doc.LoadXml(pageSourceCode);

var fnode = doc.GetElementsByTagName("extract")[0];

string ss = fnode.InnerText;

Regex regex = new Regex("\\<[^\\>]*\\>");

String.Format("Before:{0}", ss); // HTML Text

ss = regex.Replace(ss, String.Empty);

string result =  String.Format(ss);// Plain Text as a OUTPUT



TextBox.Text += result;


i was able to get the xml node i wanted by using(and modifing) the code from this answer:
store specific nodes from xml[^]
thanks for everyone's help!


这篇关于C#从Wiki页面获取Certian文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆