使用Android将Web JavaScript内容解析为字符串 [英] Parsing web javascript content to string using android

查看:40
本文介绍了使用Android将Web JavaScript内容解析为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将网站内容读成字符串.

I would like to read the content of a website into a string.

我通过使用jsoup开始,如下所示:

I started by using jsoup as follows:

private void getWebsite() {
    new Thread(new Runnable() {
        @Override
        public void run() {
            final StringBuilder builder = new StringBuilder();

            try {

                String query = "https://merhav.nli.org.il/primo-explore/search?tab=default_tab&search_scope=Local&vid=NLI&lang=iw_IL&query=any,contains,הארי פוטר";

                Document doc = Jsoup.connect(query).get();
                String title = doc.title();
                Elements links = doc.select("div");

                builder.append(title).append("\n");

                for (Element link : links) {
                    builder.append("\n").append("Link : ").append(link.attr("href"))
                            .append("\n").append("Text : ").append(link.text());
                }
            } catch (IOException e) {
                builder.append("Error : ").append(e.getMessage()).append("\n");
            }

            runOnUiThread(new Runnable() {
                @Override
                public void run() {
                    tv_result.setText(builder.toString());

                }
            });
        }
    }).start();
}

但是,问题是在该站点中,当我使用诸如chrome这样的网络浏览器时,它在其中一行中显示:

However, the problem is that in this site, when I web browser such as chrome it says in one of it lines:

window.appPerformance.timeStamps['index.html']= Date.now();</script><primo-explore><noscript>JavaScript must be enabled to use the system</noscript><style>.init-message {

所以我读到jsoup对于这种情况没有好的解决方案. 即使使用javascript也有什么好方法来获取此页面的元素?

So I read that jsoup doesn't have a good solution for this case. Is there any good way to get the element of this page even though that it uses javascript?

尝试以下建议后,我使用webView加载了网址,然后使用jsoap对其进行了解析,如下所示:

After trying the suggestions below, I used webView to load the url and then parsed it using jsoap as follows:

wb_result.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
wb_result.addJavascriptInterface(jInterface, "HtmlViewer");

wb_result.setWebViewClient(new WebViewClient() {
    @Override
    public void onPageFinished(WebView view, String url) {
        wb_result.loadUrl("javascript:window.HtmlViewer.showHTML ('<head>'+document.getElementsByTagName('html')[0].innerHTML+'</head>');");
    }
 });

它完成了工作,并确实向我展示了该元素.但是,仍然与浏览器不同,它显示某些行是功能,而不是结果.例如:

It did the job and indeed showed me the element. However, still, unlike a browser, it shows some lines as a function and not as a result. For example:

ng-href="{{::$ctrl.getDeepLinkPath()}}"

是否可以像浏览器一样解析和显示结果?

Is there a way to parse and display the result like in the browser?

谢谢

推荐答案

我建议您在chrome开发人员工具中查看网络"标签,然后提交请求以加载网址...请求回去.

I'd suggest looking at the network tab in chrome developer tools and then submitting the request to load up the URL ... you'll see a lot of requests going back/forth.

似乎包含相关内容的两个是:

Two that seem to contain relevant content are:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs?blendFacetsSeparately=false&getMore=0&inst=NNL&lang=iw_IL&limit = 10& newspapersActive = false& newspapersSearch = false& offset = 0& pcAvailability = true& q = any,contains,%D7%94%D7%90%D7%A7%D7%99 +%D7%A4%D7%95 %D7%98%D7%A8& qExclude =& qInclude =& refEntryActive = false& rtaLinks = true& scope = Local& skipDelivery = Y& sort = rank& tab = default_tab& vid = NLI

需要令牌才能访问来自以下位置的令牌:

which requires a token to access token which comes from:

..可能需要JSessoinId来自:

.. which likely requires the JSessoinId which comes from:

https://merhav.nli.org .il/primo_library/libweb/webservices/rest/v1/configuration/NLI

..因此,为了复制调用链,您可以使用JSoup发出这些(以及任何其他相关的)HTTP GET请求,拉出相关的HTTP标头(通常是:会话,引用,接受和其他一些cookie值)可能)

.. so in order to replicate the chain of calls you could use JSoup to make these (and any other relevant) HTTP GET requests, pull out the relevant HTTP headers (typically: session, referer, accept and some other cookie values potentially)

这不会很简单,但是您实际上是在从网络请求之一的JSON响应之一中寻找页面上的网址:

Its not going to be straight forward, but you're essentially looking for a url on the page in one of the JSON responses from one of the network requests:

一旦知道要重新创建的请求,您只需备份请求列表并尝试重新创建它们.

Once you know which request you want to recreate, you just have to work back up the list of requests and try to recreate them.

这不是一件容易的事,需要大量时间来重新创建-如果您要尝试重新创建,我的建议是,忘记尝试解析HTML,尝试重建/重新创建3个左右的HTTP请求链到后端以获取相关的JSON并进行解析.您通常可以拆开网站,但这是一项艰巨的任务

This one is not an easy one and would require a lot of time to recreate - my advice if you're going to attempt it, forget trying to parse HTML, try to rebuild/recreate the chain of 3 or so HTTP requests to the back end to get the relevant JSON and parse that. You can often pick apart the website but this ones a big job

这篇关于使用Android将Web JavaScript内容解析为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆