使用Java下载的PDF已损坏？ [英] Downloaded PDF with Java is corrupt?

查看：212 发布时间：2017/7/13 11:52:57 java pdf url download

本文介绍了使用Java下载的PDF已损坏？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已阅读关于的卓越讨论如何使用Java从互联网下载和保存文件。但是，如果我浏览下一个代码，我得到一个损坏的PDF。任何想法为什么？

  import java.io. *; 
 import java.net。*; 
 
 public class PDFDownload {
 public static String URL =http://www.nbc.com/Heroes/novels/downloads/; 
 public static String FOLDER =C：/ Users / sdelamo / workspace / SandBox / HeroesNovel /; 
 
 public static void main（String [] args）{
 String filename =Heroes_novel_001.pdf; 
尝试{
 saveUrl（FOLDER + filename，URL + filename）; 
} catch（MalformedURLException e）{
 System.out.println（MalformedURLException）; 
} catch（IOException e）{
 System.out.println（IOException）; 
} 
} 
 
 
 
 public static void saveUrl（String filename，String urlString）throws MalformedURLException，IOException {
 BufferedInputStream in = null ; 
 FileOutputStream fout = null; 
 try {
 URL url = new URL（urlString）; 
 in = new BufferedInputStream（url.openStream（））; 
 fout = new FileOutputStream（filename）; 
 
字节数据[] =新字节[1024]; 
 int count; 
 while（（count = in.read（data，0，1024））！= -1）{
 fout.write（data，0，count）; 
} 
} finally {
 if（in！= null）
 in.close（）; 
 if（fout！= null）
 fout.close（）; 
} 
} 
}

以上代码下载html的PDF。这是输出：

 <？xml version =1.0encoding =UTF-8？> 
<！DOCTYPE html PUBLIC -  // W3C // DTD XHTML Basic 1.1 // EN
http://www.w3.org/TR/xhtml-basic/xhtml-basic11。 DTD> 
 
< html xmlns =http://www.w3.org/1999/xhtmlxml：lang =enlang =en> 
< head> 
 
< meta name =viewportcontent =width = 240，user-scalable = yes/> 
< HTTP-EQUIV =PRAGMACONTENT =NO-CACHE> 
< META HTTP-EQUIV =ExpiresCONTENT = -  1> 
< meta http-equiv =Cache-controlcontent =no-cache> 
< meta http-equiv =Cache-controlcontent =must-revalidate> 
< meta http-equiv =Cache-controlcontent =max-age = 0> 
< meta http-equiv =refreshcontent =200> 
 
< title> NBC.com：Heroes< / title> 
< link rel =stylesheettype =text / csshref =/ style / default.css？sid = 8a9212f822e1c675330ec418bc531169/> 
< link rel =stylesheettype =text / csshref =/ style / hro.css？sid = 8a9212f822e1c675330ec418bc531169/> 
 
< / head> 
< body> 
< center>< img src =http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&gn=NBC .com前门& c2 =& c3 =杂项& c6 =& c6 = m.nbc.com / show / hro& c8 =电视娱乐& c9 = NBC网络& c10 =& c11 = |& c12 = |& c25 = offdeck& c27 = internal& c29 =& c44 = D = User-Agent& r =width =5height =5border =0/>< / center& 
< h1 id =fHeader> 
< a href =/？sid = 8a9212f822e1c675330ec418bc531169> 
< img src =/ images / nbc_logo.gifalt =NBC：logoborder =0/> 
< / a> 
< / h1> 
 
< h2> 
< a href =/ show / hro？sid = 8a9212f822e1c675330ec418bc531169> 
< img src =/ images / shows / 1221684699_Heroes_WAP_166x54.jpgalt =Heroes：showheaderborder =0/> 
< / a> 
< / h2> 
< div id =tunein_nexton> 
< span id =tunein>星期一9 / 8c< / span> 
< / div><！ -  end #tunein_nexton  - > 
< div id =tunein_nexton> 
<！ - < span id =tunein>星期一8 / 7c< / span>  - > 
 
< p id =nexton>< span class =sectiontitle>< / span>< / p> 
< / div><！ -  end #tunein_nexton  - > 
< div id =featuredcontent> 
< h3>特征内容< / h3> 
< table id =featuredItemsTable> 
 
< tr> 
< td>< a href =/ show / hro / videos.html？sid = 8a9212f822e1c675330ec418bc531169>< img src =/ images / hro / nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpgalt = >< / A> 
< / td> 
< td> 
< span class =ftitle> Dreams< / span> 
< span class =fdesc>英雄首映式星期一，9月21日...< / span> 
< / td> 
< / tr> 
< tr> 
< td>< a href =/ show / hro / recaps.html？sid = 8a9212f822e1c675330ec418bc531169>< img src =http://origin-www.nbc.com/Heroes/images /episodes/season3/325/hro_325_01.jpgalt =featureheight =45width =80/>< / a> 
< / td> 
< td> 
< span class =ftitle> Recap：< / span> 
< span class =fdesc>第3季剧集不可见的线程< / span> 
< / td> 
< / tr> 
< tr> 
< td>< a href =/ show / hro / photos.html？sid = 8a9212f822e1c675330ec418bc531169>< img src =http://origin-www.nbc.com/app2/img /200x200xS/scet/photos/51/3736/NUP_110031_0323.JPGalt =featuresheight =45width =80/>< / a> 
< / td> 
< td class =finfo> 
< span class =ftitle>照片：< / span> 
< span class =fdesc> HeroesCast Photos< / span> 
< / td> 
< / tr> 
< / table> 
 
 
< / div><！ -  end #featuredcontent  - > 
 
< h3> HEROES< / h3> 
< table class =showNav> 
< tr>< td>< a href =/ show / hro / about.html？sid = 8a9212f822e1c675330ec418bc531169accesskey =1>关于< / a>< / td& / TR> 
< tr>< td>< a href =/ show / hro /Videos.html？sid = 8a9212f822e1c675330ec418bc531169accesskey =2>视频< / a>< / td& / TR> 
< tr>< td>< a href =/ show / hro / recaps.html？sid = 8a9212f822e1c675330ec418bc531169accesskey =3>剧集回顾< / a>< / td>< ; / TR> 
< tr>< td>< a href =/ show / hro / photos.html？sid = 8a9212f822e1c675330ec418bc531169accesskey =4>照片< / a>< / td& / TR> 
< tr>< td>< a href =/ show / hro / community.html？sid = 8a9212f822e1c675330ec418bc531169accesskey =5> Community< / a>< / td& / TR> 
< tr>< td>< a href =/ shows.shtml？sid = 8a9212f822e1c675330ec418bc531169accesskey =6>显示列表< / a>< / td>< / tr> 
< / table> 
<！ - < a href =http://www.insightexpress.com/ix/Survey.aspx?id=151580&accessCode=3161643404&sid=8a9212f822e1c675330ec418bc531169>< img src = /images/mNBCcom_166x54.jpgborder =0>< / a> - > 
 
 
 
< div class =footeralign =center>< a href =http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169 < strong> NBC Mobile Main< / strong>< / a> | < a href =/ terms.shtml？sid = 8a9212f822e1c675330ec418bc531169>< strong>使用条款< / strong>< / a> | < a href =/ privacy.shtml？sid = 8a9212f822e1c675330ec418bc531169>< strong>隐私< / strong>< / a>< / div>< div class =cpyrtalign =center> ;&安培;＃169; NBC Universal，Inc.< / div> 
 
< / body> 
< / html>

任何想法如何下载PDF？

解决方案

在连接之前设置用户代理。

  URL u = new URL（urlString）; 
 HttpURLConnection huc =（HttpURLConnection）u.openConnection（）; 
 huc.setRequestMethod（GET）; 
 huc.setRequestProperty（User-Agent，Mozilla / 5.0（Windows; U; Windows NT 6.0; en-US; rv：1.9.1.2）Gecko / 20090729 Firefox / 3.5.2（.NET CLR 3.5 0.30729））; 
 huc.connect（）; 
 
 in = new BufferedInputStream（huc.getInputStream（））;

解决方案

与您的其他问题同样的问题。如果NBC.com认为您是刮刀，NBC.com不会将PDF发送给您：）

相同的技巧将会执行，

pre>

 conn.setRequestProperty（User-Agent，Mozilla / 5.0（Macintosh; U; Intel Mac OS X 10.5; en-US; rv：1.9.0.13）Gecko / 2009073021 Firefox / 3.0.13）;

I have read the excelent discussion about How to download and save a file from internet using Java. However, if I exectue the next code, i get a corrupt PDF. Any idea why?

import java.io.*;
import java.net.*;

public class PDFDownload {
    public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
    public static String FOLDER = "C:/Users/sdelamo/workspace/SandBox/HeroesNovel/";

    public static void main(String[] args) {
        String filename = "Heroes_novel_001.pdf";
        try {
            saveUrl(FOLDER + filename, URL + filename);
        } catch (MalformedURLException e) {
            System.out.println("MalformedURLException");
        } catch (IOException e) {
            System.out.println("IOException");                              
        }                       
    }       



    public static void saveUrl(String filename, String urlString) throws MalformedURLException, IOException {
        BufferedInputStream in = null;
        FileOutputStream fout = null;
        try {
            URL url = new URL(urlString);
            in = new BufferedInputStream(url.openStream());
            fout = new FileOutputStream(filename);

            byte data[] = new byte[1024];
            int count;
            while ((count = in.read(data, 0, 1024)) != -1) {
                fout.write(data, 0, count);
            }
        } finally {
            if (in != null)
                in.close();
            if (fout != null)
                fout.close();
        }
    }
}

The above code downloads html instead of a PDF. This is the output:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
<meta http-equiv="refresh" content="200">

<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=8a9212f822e1c675330ec418bc531169" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=8a9212f822e1c675330ec418bc531169" /> 

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&gn=NBC.com Front Door&c2=&c3=Miscellaneous&c4=&c6=m.nbc.com/show/hro&c8=TV Entertainment&c9=NBC Network&c10=&c11= | &c12= | &c25=offdeck&c27=internal&c29=&c44=D=User-Agent&r=" width="5" height="5" border="0" /></center>
<h1 id="fHeader">
<a  href="/?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/nbc_logo.gif" alt="NBC : logo" border="0" />
</a>
</h1>

<h2>
<a  href="/show/hro?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/shows/1221684699_Heroes_WAP_166x54.jpg" alt="Heroes : showheader" border="0" />
</a>
</h2>
<div id="tunein_nexton">
    <span id="tunein">Mondays 9/8c</span>
</div><!--end #tunein_nexton-->
<div id="tunein_nexton">
    <!--<span id="tunein">Mondays 8/7c</span>-->

    <p id="nexton"><span class="sectiontitle"></span></p>
</div><!--end #tunein_nexton-->
<div id="featuredcontent">
    <h3>FEATURED CONTENT</h3>
    <table id="featuredItemsTable">

        <tr>
            <td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="/images/hro/nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpg" alt="featured" /></a>
            </td>
            <td>
                <span class="ftitle">Dreams</span>
                <span class="fdesc">Heroes premieres Mon., Sept. 21s...</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/Heroes/images/episodes/season3/325/hro_325_01.jpg" alt="featured" height="45" width="80"/></a>
            </td>
            <td>
                <span class="ftitle">Recap:</span>
                <span class="fdesc">Season 3 Episode An Invisible Thread</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/app2/img/200x200xS/scet/photos/51/3736/NUP_110031_0323.JPG" alt="featured" height="45" width="80"/></a>
            </td>
            <td class="finfo">
                <span class="ftitle">Photo:</span>
                <span class="fdesc">Heroes "Cast Photos"</span>
            </td>
        </tr>
                    </table>


</div><!--end #featuredcontent-->

<h3>HEROES</h3>
<table class="showNav">
    <tr><td><a  href="/show/hro/about.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="1">About</a></td></tr>
        <tr><td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="2">Videos</a></td></tr>
                <tr><td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="3">Episode Recaps</a></td></tr>
                    <tr><td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="4">Photos</a></td></tr>
                <tr><td><a  href="/show/hro/community.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="5">Community</a></td></tr>
    <tr><td><a  href="/shows.shtml?sid=8a9212f822e1c675330ec418bc531169" accesskey="6">Shows List</a></td></tr>
</table>
<!-- <a  href="http://www.insightexpress.com/ix/Survey.aspx?id=151580&accessCode=3161643404&sid=8a9212f822e1c675330ec418bc531169" ><img src="/images/mNBCcom_166x54.jpg" border="0"></a> -->



<div class="footer" align="center"><a  href="http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169"><strong>NBC Mobile Main</strong></a> | <a  href="/terms.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Terms of Use</strong></a> | <a  href="/privacy.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Privacy</strong></a></div><div class="cpyrt" align="center">&#169; NBC Universal, Inc.</div>

</body>
</html>

Any idea how to download the PDF?

SOLUTION

Set User-Agent before connecting.

URL u = new URL(urlString); 
HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
huc.setRequestMethod("GET"); 
huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
huc.connect();          

in = new BufferedInputStream(huc.getInputStream());

解决方案

This is the same issue with your other question. NBC.com doesn't send back PDF to you if it thinks you are a scraper :)

Same tricks will do,

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");

这篇关于使用Java下载的PDF已损坏？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Java下载的PDF已损坏？ [英] Downloaded PDF with Java is corrupt?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Java下载的PDF已损坏？ [英] Downloaded PDF with Java is corrupt?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭