使用MediaWiki从Wikia页面中提取文本,但又陷入混乱,是否有更好的方法可以从每个部分提取文本? [英] Using MediaWiki to pull text from a Wikia page but it comes back in a big mess is there a better way I could do this to pull text from each section?

查看:82
本文介绍了使用MediaWiki从Wikia页面中提取文本,但又陷入混乱,是否有更好的方法可以从每个部分提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个Android应用程序,该应用程序可从Wikia页面提取信息并将其显示在该应用程序中.我目前正在拉动所有类别进行导航,并设置了我的应用程序以在WebView中显示页面,但我只想拉动信息并自行设置格式,而不是通过传递给WebView来使其廉价.

I am developing an Android app that pulls information from a Wikia page and displays it in the app. I currently am pulling all Categories to navigate and have my app set up to display the page in a WebView but I would like to just pull the info and format myself instead of cheapening it by passing to WebView.

我用来获取文本的是:

What I am using to get the text is: http://scottlandminecraft.wikia.com/api.php?format=xml&action=query&titles=ZackScott&prop=revisions&rvprop=content

我的问题是文本重新组合在一起,是否有人对如何使其更格式化有任何想法,以便我可以从标签中进行解析,或者我在浪费时间试图找到它?如果是这样,最好从提取的文本中找到标识符来解析我需要的文本,还是有更好的方法?

My problem is the text comes back in a big clump, does anyone have any ideas as to how to get this more formatted so I could parse from tags or am I wasing my time trying to find that? If so would it be better to find a way to parse the text I need by going from identifiers in the text this pulls, or is there a better way?

感谢您的投入和时间.

推荐答案

如果您不想自己解析Wiki标记,最简单的方法是检索页面的已解析HTML版本,然后使用HTML解析器(如Hasham推荐的 jsoup ).

The easiest way, if you don't want to parse the wiki markup yourself, is to retrieve the parsed HTML version of the page and then process it using an HTML parser (like jsoup, as recommended by Hasham).

除了抓取普通的Wiki用户界面(这将使您的页面HTML包裹在导航皮肤中)之外,还有两种获取MediaWiki页面的HTML文本的方法:

Besides just scraping the normal wiki user interface (which will give you the page HTML wrapped in the navigation skin), there are two ways of getting the HTML text of a MediaWiki page:

  1. 将API与 action=parse 一起使用,这将返回包装在其中的HTML页面MediaWiki API XML(或JSON/YAML等)响应,如下所示:

  1. use the API with action=parse, which will return the page HTML wrapped in a MediaWiki API XML (or JSON / YAML / etc.) response, like this:

或将主index.php脚本与 action=render 一起使用,这将返回页面HTML:

or use the main index.php script with action=render, which will return just the page HTML:

Ps.由于您在问题中提到了各个部分,因此请注意,action=parse API模块可以使用prop=sections(甚至是prop=sections|text)返回有关页面上各部分的信息.有关示例,请参见以下API查询:

Ps. Since you mention sections in your question, let me note that the action=parse API module can return information about the sections on the page using prop=sections (or even prop=sections|text). For an example, see this API query:

这篇关于使用MediaWiki从Wikia页面中提取文本,但又陷入混乱,是否有更好的方法可以从每个部分提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆