PDFBox 的内存泄漏问题 [英] Memory Leak Issue With PDFBox

查看：78 发布时间：2021/11/14 23:44:47 pdfbox apache-tika

本文介绍了PDFBox 的内存泄漏问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在我的应用程序中使用 PDF Box version 2.0.9.我必须从网络解析大型 pdf 文件.以下是我正在使用的代码

I am using PDF Box version 2.0.9 in my application. I have to parse large pdf files from web. Following is the code I am using

MimeDetector 类

MimeDetector Class

    @Getter
    @Setter
    class MimeTypeDetector {
        private ByteArrayInputStream byteArrayInputStream;
        private BodyContentHandler bodyContentHandler;
        private Metadata metadata;
        private ParseContext parseContext;
        private Detector detector;
        private TikaInputStream tikaInputStream;

        MimeTypeDetector(ByteArrayInputStream byteArrayInputStream) {
            this.byteArrayInputStream = byteArrayInputStream;
            this.bodyContentHandler = new BodyContentHandler(-1);
            this.metadata = new Metadata();
            this.parseContext = new ParseContext();
            this.detector = new DefaultDetector();
            this.tikaInputStream = TikaInputStream.get(new CloseShieldInputStream(byteArrayInputStream));
        }
    }

    
    private void crawlAndSave(String url, DomainGroup domainGroup)  {
        MimeTypeDetector mimeTypeDetector = null;
        try {
            String decodeUrl = URLDecoder.decode(url, WebCrawlerConstants.UTF_8);
            ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(HTMLFetcher.fetch(WebCrawlerUtil.encodeUrl(url)));
            mimeTypeDetector = new MimeTypeDetector(byteArrayInputStream);
            String contentType = getContentType(mimeTypeDetector);
            if (isPDF(contentType)) {
                crawlPDFContent(decodeUrl, mimeTypeDetector, domainGroup);
            } else if (isWebPage(contentType)) {
                // fetching HTML web Page Content
            } else {
                log.warn("Skipping URL::" + url + ".Not a supported crawler format");
                linksVisited.remove(url);
            }
        } catch (IOException e) {
            log.error("crawlAndSave:: Error occurred while decoding URL:" + url + " : " + e.getMessage());
            // some catch operation
        } finally {
            if (Objects.nonNull(mimeTypeDetector)) {
                IOUtils.closeQuietly(mimeTypeDetector.getByteArrayInputStream());
            }
        }
    }

    private String getContentType(MimeTypeDetector mimeTypeDetector) throws IOException {
        TikaInputStream tikaInputStream = mimeTypeDetector.getTikaInputStream();
        String contentType = mimeTypeDetector.getDetector().detect(tikaInputStream, mimeTypeDetector.getMetadata()).toString();
        tikaInputStream.close();
        return contentType;
    }

    private void crawlPDFContent(String url, MimeTypeDetector mimeTypeDetector, DomainGroup domainGroup) {
        try {
            private PDFParser pdfParser = new PDFParser();
            pdfParser.parse(mimeTypeDetector.getByteArrayInputStream(), mimeTypeDetector.getBodyContentHandler(),
                    mimeTypeDetector.getMetadata(), mimeTypeDetector.getParseContext());
            // Some Database operation
        } catch (IOException | TikaException | SAXException e) {
            //Some Catch operation
            log.error("crawlPDFContent:: Error in crawling PDF Content" + " : " + e.getMessage());
        }
    }

HTML 提取器

    public class HTMLFetcher {

    private HTMLFetcher() {
    }

    /**
     * Fetches the document at the given URL, using {@link URLConnection}.
     *
     * @param url
     * @return
     * @throws IOException
     */
    public static byte[] fetch(final URL url) throws IOException {

        TrustManager[] trustAllCerts = new TrustManager[]{new X509TrustManager() {
            public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                return null;
            }

            public void checkClientTrusted(X509Certificate[] certs, String authType) {
            }

            public void checkServerTrusted(X509Certificate[] certs, String authType) {
            }

        }};

        SSLContext sc = null;
        try {
            sc = SSLContext.getInstance("SSL");
            sc.init(null, trustAllCerts, new java.security.SecureRandom());
            HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (NoSuchAlgorithmException | KeyManagementException e) {
            e.printStackTrace();
        }

        // Create all-trusting host name verifier
        HostnameVerifier allHostsValid = (hostname, session) -> true;

        HttpsURLConnection.setDefaultHostnameVerifier(allHostsValid);

        setAuthentication(url);
        //Taken from Boilerpipe
        final HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        InputStream in = conn.getInputStream();
        byte[] byteArray = IOUtils.toByteArray(in);
        in.close();
        conn.disconnect();
        return byteArray;
    }

    private static void setAuthentication(URL url) {
        AuthenticationDTO authenticationDTO = WebCrawlerUtil.getAuthenticationFromUrl(url);
        if (Objects.nonNull(authenticationDTO)) {
            Authenticator.setDefault(new Authenticator() {
                protected PasswordAuthentication getPasswordAuthentication() {
                    return new PasswordAuthentication(authenticationDTO.getUserName(),
                            authenticationDTO.getPassword().toCharArray());
                }
            });
          }
       }
    }

但是当我检查内存统计信息时，内存使用量不断增加.我使用 visualVM 和 YourKit Java 分析器验证了这一点.

But when I am checking memory stats, the memory usage is increasing constantly. I verified this using visualVM and YourKit Java profiler.

检查附加的图像.

我做错了什么吗?我搜索了类似的问题，例如 this 和 this 但有人提到此问题已在最新版本中修复.

Is there anything I am doing wrong? I searched for similar issues like this and this but it was mentioned that this issue has been fixed in latest versions.

PDFBox 的内存泄漏问题 [英] Memory Leak Issue With PDFBox

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PDFBox 的内存泄漏问题 [英] Memory Leak Issue With PDFBox

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭