使用Java线程下载大量网页 [英] use java thread to download large number of web pages
本文介绍了使用Java线程下载大量网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
您好,当我抓取许多网页(要处理的网址超过50,000个)时,我遇到了一个问题,它的执行速度非常慢,因此我想用线程进行重构,有人可以提出一些想法,非常感谢很多!
Hello, I face a problem when I crawl lots of web pages(there are more than 50,000 url to process), it excute very slow,so I want to refactor it with thread,can anybody give some idea,thank you very much!
import ......
public class Down2011CaseMeshTread extends Thread {
public static int count = 0;
public static List<String> docDoiList = getCaseDoiList2();
private static URL url;
private static String doi;
static int BUFFER_SIZE = 1024*10;
public Down2011CaseMeshTread(String doi) throws MalformedURLException{
String urlStr = "http://www.codeproject.com/" + doi;
this.url = new URL(urlStr);
this.doi = doi;
}
public static Connection getConnection() throws Exception {
String driver = "com.mysql.jdbc.Driver";
String url = "jdbc:mysql://192.168.1.102:3306/clef11";
String username = "root";
String password = "111111";
Class.forName(driver);
Connection conn = DriverManager.getConnection(url, username, password);
return conn;
}
public static String getDocNameByDoi(String docDoi){
ResultSet rs = null;
Connection conn = null;
PreparedStatement pstmt = null;
String docName = null;
try {
conn = getConnection();
String query = "select filename from casebase where doi = ?";
pstmt = conn.prepareStatement(query); // create a statement
pstmt.setString(1, docDoi); // set input parameter
rs = pstmt.executeQuery();
while(rs.next()){
String name = rs.getNString(1);
docName = name.subSequence(0, name.lastIndexOf(".")).toString();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
rs.close();
pstmt.close();
conn.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
return docName;
}
public void Test() throws IOException, InterruptedException{
StringBuffer sb = null;
BufferedReader in = null;
BufferedWriter out = null;
try {
sb = new StringBuffer();
int ch =0;
URLConnection conn = (HttpURLConnection)url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/4.76 (compatible; MSIE 5.0; Windows NT; DigExt)");
conn.setDoOutput(true);
conn.setConnectTimeout(1000*60*10);
in = new BufferedReader(new InputStreamReader(url.openStream()));
FileOutputStream fo = new FileOutputStream("/home/boge/workspace1/IDF/case_mesh/" + getDocNameByDoi(doi));
OutputStreamWriter writer = new OutputStreamWriter(fo, "utf-8");
out = new BufferedWriter(writer);
while (!in.ready())
{
Thread.sleep(500); // wait for stream to be ready.
}
char[] buffer = new char[BUFFER_SIZE];
int charsRead = 0;
while ( (charsRead = in.read(buffer, 0, BUFFER_SIZE)) != -1 ) {
out.write(buffer, 0, charsRead);
}
out.close();
in.close();
}catch(Exception e){
e.printStackTrace();
}
}
public void run(){
try {
Test();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public static List<String> getCaseDoiList2(){
List<String> docDoiList = new LinkedList<String>();
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream("/home/boge/5.28.3")));
String line = null;
while((line = br.readLine())!=null){
docDoiList.add(line.trim());
}
}catch(Exception e){
e.printStackTrace();
}
return docDoiList;
}
public static List<String> getFiles(String fileName) {
List<String> fileList = new ArrayList<String>();
File directory = new File(fileName);
for (File file : directory.listFiles()) {
if (file.isFile() && !file.isHidden()) {
fileList.add(file.getName());
}
}
return fileList;
}
public static void main(String args[]) throws MalformedURLException{
List<String> filedowns = getFiles("/home/boge/workspace1/IDF/case_mesh");
for(String docDoi : docDoiList){
if(filedowns.contains(getDocNameByDoi(docDoi))){
continue;
}else{
Down2011CaseMeshTread down = new Down2011CaseMeshTread(docDoi);
down.start();
}
}
}
}
但是此代码可以正常工作,请您帮帮我! :)
But this code can work correctly, can you help me! :)
推荐答案
您可以使用ThreadPoolExecutor
将应用程序的有效负载拆分为Runnable
任务. 您的代码对我来说太长了,因此我仅向您展示一个示例,说明如何使用这种方法并行下载网页;
You can use theThreadPoolExecutor
to split your application''s payload intoRunnable
tasks.
Your code is a bit too long for me to clean up so I''ll just show you an example of how to download web pages in parallel using this approach;
package threadtest;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
public class Program {
private class Downloader implements Runnable {
private final URL url;
public Downloader(URL url) {
this.url = url;
}
private String readAll(Reader reader) throws IOException {
StringBuilder builder = new StringBuilder();
int read = 0;
while((read = reader.read()) != -1) {
builder.append((char)read);
}
return builder.toString();
}
@Override
public void run() {
try {
Reader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(url.openStream()));
String result = readAll(reader);
System.out.printf("Read %d characters from %s\n", result.length(), url);
}
finally {
if (reader != null)
reader.close();
}
}
catch(IOException e) {
System.err.println(e);
}
}
}
public void runIt() throws MalformedURLException {
BlockingQueue<Runnable> runnables = new ArrayBlockingQueue<Runnable>(1024);
ThreadPoolExecutor executor = new ThreadPoolExecutor(8, 16, 60, TimeUnit.SECONDS, runnables);
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.submit(new Downloader(new URL("http://www.google.com")));
executor.shutdown();
}
public static void main(String[] args) throws IOException {
Program program = new Program();
program.runIt();
System.in.read();
}
}
希望这会有所帮助,
弗雷德里克·博纳德(Fredrik Bornander)
Hope this helps,
Fredrik Bornander
这篇关于使用Java线程下载大量网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文