有什么比Jsoup更快HTML刮? [英] Is there anything faster than Jsoup for HTML scraping?

查看:278
本文介绍了有什么比Jsoup更快HTML刮?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我要建一个显示来自一个网站,我去一个更友好的用户界面的imageboard的应用程序。还有目前有很多与它的问题,但最大的之一,现在是取图像来显示。

我有现在的方式,图像显示大小为12的Gri​​dView,镜像imageboard的每一页上的图像数量。我使用Jsoup刮页面的缩略图图像的URL在GridView作为获取网址为全尺寸的图像,当用户点击缩略图显示显示,以及。

现在的问题是,时间介于8-12秒平均Jsoup获得HTML页面刮。这一点我觉得不能接受的,或者我想知道是否有什么办法使这个速度更快,如果这将是一个固有的瓶颈,我不能做任何事情。

这里的code我使用抓取网页刮:

  {尝试
    文档的DOC = Jsoup.connect(URL)获得();
    (IMG [/ src目录* = / ALT2])元素链接= doc.select;
    对于(元素链接:链接){
        thumbURL = link.attr(SRC);
        linkURL = thumbURL.replace(/ ALT2 /,/\").replace(\"s.jpg,.JPG);
        imgSrc.add(新配对<字符串,字符串>(thumbURL,linkURL));
    }
}
抓{
    e.printStackTrace();
}


解决方案

我碰到了同样的问题:

在我的HTC One S上的LogCat中清楚地表明,在连接反应只需要前4秒(3连接并列)。解析需要近30-40秒这是一个巨大的时间..注意到的HTC One S有一个非常快的dualcore @ 1,4ghz ..这个问题显然不是连接到仿真


  02-27 14:11:55.278:DEBUG / MyActivity(10735):= c取代;
02-27 14:11:55.278:DEBUG / MyActivity(10735):= c取代;
02-27 14:11:55.278:DEBUG / MyActivity(10735):= c取代;
02-27 14:11:59.002:调试/ MyActivity(10735):其中; R =
02-27 14:11:59.012:调试/ MyActivity(10735):其中; R =
02-27 14:11:59.422:调试/ MyActivity(10735):其中; R =
02-27 14:12:33.949:DEBUG / MyActivity(10735):在D =
02-27 14:12:37.463:DEBUG / MyActivity(10735):在D =
02-27 14:12:38.294:DEBUG / MyActivity(10735):在D =


这是我的code:

  // Jsoup-连接
连接C = Jsoup.connect(网址[0]);
//在毫秒请求超时
c.timeout(5000);
Connection.Response R = c.execute();
Log.d(MyActivity,&所述; R = doInBackground(+网址[0] +));//获取实际的文档
文档的文档= r.parse();
Log.d(MyActivity,&所述; D = doInBackground(+网址[0] +));

更新:

  20 02-27:38:25.649:信息/ MyActivity(18253)!= c取代;
02-27 20:38:27.511:信息/ MyActivity(18253):其中; R =!
02-27 20:38:28.873:信息/ MyActivity(18253):#!D =

我得到了一些新的结果..在previosu那些来自运行我的Andr​​oid应用程序为调试 ..现在公布结果是从没有调试模式(来自的IntelliJ IDE)中运行。为什么调试,使Jsoup任何解释这么慢?

运行于debuggin在我的酷睿i5 - 桌面 - 机我没有性能损失。

罪魁祸首,为什么我的code是Android上的这么慢是definitly在 DEBUG模式模式..它减缓jsoup下降了100倍。

So I'm building an app that displays an imageboard from a website I go to in a more user-friendly interface. There's a lot of problems with it at the moment, but the biggest one right now is fetching the images to display them.

The way I have it right now, the images are displayed in a GridView of size 12, mirroring the number of images on each page of the imageboard. I'm using Jsoup to scrape the page for the thumbnail image URLs to display in the GridView, as well as getting the URLs for the full size images to display when a user clicks on the thumbnail.

The problem right now is that it takes anywhere from 8-12 seconds on average for Jsoup to get the HTML page to scrape. This I find unacceptable and I was wondering if there was any way to make this faster or if this is going to be an inherent bottleneck that I can't do anything about.

Here's the code I'm using to fetch the page to scrape:

try {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("img[src*=/alt2/]");
    for (Element link : links) {
        thumbURL = link.attr("src");
        linkURL = thumbURL.replace("/alt2/", "/").replace("s.jpg", ".jpg");
        imgSrc.add(new Pair<String, String>(thumbURL, linkURL));
    }
}
catch {
    e.printStackTrace();
}

解决方案

I ran into the very same issue:

The Logcat on my HTC One S clearly shows that the connection-response only takes the first 4 Seconds (3 Connections in parallel). The Parsing takes almost 30-40 Seconds which is a HUGE time .. notice that the HTC One S has a very fast dualcore @ 1,4ghz .. The problem is clearly not connected to the emulator

02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:59.002: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.012: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.422: DEBUG/MyActivity(10735): <r=
02-27 14:12:33.949: DEBUG/MyActivity(10735): <d=
02-27 14:12:37.463: DEBUG/MyActivity(10735): <d=
02-27 14:12:38.294: DEBUG/MyActivity(10735): <d=

This is my code:

// Jsoup-Connection
Connection c = Jsoup.connect(urls[0]);
// Request timeout in ms
c.timeout(5000);
Connection.Response r = c.execute();
Log.d("MyActivity","<r= doInBackground ("+urls[0]+")");

// Get the actual Document
Document doc = r.parse();
Log.d("MyActivity","<d= doInBackground ("+urls[0]+")");

Update:

02-27 20:38:25.649: INFO/MyActivity(18253): !=c> 
02-27 20:38:27.511: INFO/MyActivity(18253): !<r= 
02-27 20:38:28.873: INFO/MyActivity(18253): !#d=

I got some new results .. the previosu ones were from running my app on android as DEBUGGING .. the now posted results are from running without debugging mode (from IntelliJ IDE) .. any explanation why debugging makes Jsoup so slow?

Running on debuggin on my i5-Desktop-Machine I got no performance-penalty.

The culprit why my code is so slow on Android is definitly the DEBUG-Mode mode .. it slows jsoup down by factor 100.

这篇关于有什么比Jsoup更快HTML刮?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆