Java:从互联网上的目录中读取文本文件 [英] Java: Read in text files from a directory, from the internet

查看:125
本文介绍了Java:从互联网上的目录中读取文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道如何递归读取来自互联网上特定目录的文件,在Java中?
我想读取这个网站目录中的所有文本文件: http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/



我知道如何阅读在我的计算机上的文件夹中的多个文件中,以及如何从互联网上读取单个文件。但是我怎样才能读取互联网上的多个文件,而不硬编码的网址?



我试过的东西:

  //列出桌面上的文件
final File folder = new File(/ Users / crystal / Desktop);
File [] listOfFiles = folder.listFiles();

for(int i = 0; i< listOfFiles.length; i ++){
File fileEntry = listOfFiles [i];
if(!fileEntry.isDirectory()){
System.out.println(fileEntry.getName());


$ / code>

我试过的另一件事:

  //从网上读取数据
尝试
{
//创建一个URL对象
URL url = new URL(http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/5_1_1.txt);

//读取HTTP服务器返回的所有文本
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String htmlText; //保存当前文件行的字符串

//一次读取一行文件。 ((htmlText = in.readLine())!= null)

System.out.println(htmlText);
}
in.close();
} catch(MalformedURLException e){
e.printStackTrace();
} catch(IOException e){
//如果生成另一个异常,则打印堆栈跟踪
e.printStackTrace();

$ / code>

谢谢!

解决方案

由于您提到的网址已启用索引,因此您很幸运。
您在这里有几个选项。


  1. 解析html以找到a标签的属性,使用SAX2或任何其他的XML解析器。 htmlunit也会工作,我想。

  2. 使用一点正则表达式魔术来匹配< a href =和<$

一旦您将已经得到了你需要的所有URL的列表,然后第二块代码应该工作得很好。只需迭代你的列表,并从该列表构造你的URL。

这里有一个示例正则表达式应该匹配你想要的。它确实捕获了一些额外的链接,但你应该能够过滤出来。

 < a\ href = (?+)> 


Does anybody know how to recursively read in files from a specific directory on the internet, in Java? I want to read in all the text files from this web directory: http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/

I know how to read in multiple files that are in a folder on my computer, and I how to read in a single file from the internet. But how can I read in multiple files on the internet, without hardcoding the URLs in?

Stuff I tried:

// List the files on my Desktop
final File folder = new File("/Users/crystal/Desktop");
File[] listOfFiles = folder.listFiles();

for (int i = 0; i < listOfFiles.length; i++) {
    File fileEntry = listOfFiles[i];
    if (!fileEntry.isDirectory()) {
        System.out.println(fileEntry.getName());
    }
}

Another thing I tried:

// Reading data from the web 
try 
{
    // Create a URL object
    URL url = new URL("http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/5_1_1.txt");

    // Read all of the text returned by the HTTP server
    BufferedReader in = new BufferedReader (new InputStreamReader(url.openStream()));

    String htmlText;      // String that holds current file line

    // Read through file one line at a time. Print line
    while ((htmlText = in.readLine()) != null) 
    {
        System.out.println(htmlText);
    }
    in.close();
} catch (MalformedURLException e) {
    e.printStackTrace();
} catch (IOException e) {
    // If another exception is generated, print a stack trace
    e.printStackTrace();
}

Thanks!

解决方案

Since the URL you mentioned has indexes enabled, you're in luck. You've got a few options here.

  1. Parse the html to find the attribute of the a tags, using SAX2 or any other XML parser. htmlunit would also work I think.
  2. Use a little regexp magic to match all string between <a href=" and "> and use that as the urls to read from.

Once you've got a list of all the URLs you need, then the second piece of code should work just fine. Just iterate over your list, and construct your URL from that list.

Here's a sample regex that should match what you want. It does catch a few extra links, but you should be able to filter those out.

<a\ href="(.+?)">

这篇关于Java:从互联网上的目录中读取文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆