如何在不下载文件的情况下在Amazon S3存储桶中搜索文件内容 [英] how to search for file contents in amazon S3 bucket without downloading the file

查看:182
本文介绍了如何在不下载文件的情况下在Amazon S3存储桶中搜索文件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 n个文件上传到了Amazon S3 ,我需要* 搜索 *这些文件是基于字符串在其 contents ,我尝试了一种从S3存储桶下载文件的方法,将输入流转换为字符串,然后在内容中搜索单词,但是如果它们超过五到六个文件,则需要花费很多时间以上过程,

是他们执行此操作的其他任何方式,请提前帮助谢谢.

解决方案

我不熟悉Amazon S3,但是处理搜索远程文件的一般方法是使用,索引将告诉您哪些文件包含"foo" ,哪些文件包含"bar" .这些结果的横截面将是同时包含"foo" "bar" 的文件.您必须直接扫描这些文件,以选择"foo" "bar" 以正确的顺序彼此相邻的文件(如果有的话).

无论如何,下载到客户端的数据量将远远少于下载和扫描所有内容,尽管这还取决于文件的结构和搜索模式.

i have n number of files uploaded to amazon S3 i need*search* those files based on occurrence of an string in its contents , i tried one method of downloading the files from S3 bucket converting input stream to string and then search for the word in content , but if their are more than five to six files it takes lot of time to do the above process,

is their any other way to do this , please help thanks in advance.

解决方案

I am not familiar with Amazon S3, but the general way to deal with searching remote files is to use indexing, with the index itself being stored on the remote server. That way each search will use the index to deduce a relatively small number of potential matching files and only those will be scanned directly to verify if they are indeed a match or not. Depending on your search terms and the complexity of the pattern, it might even be possible to avoid the direct file scan altogether.

That said, I do not know whether Amazon S3 has an indexing engine that you can use or whether there are supplemental libraries that do that for you, but the concept is simple enough that you should be able to get something working by yourself without too much work.

EDIT:

Generally the tokens that exist in each file are what is indexed. For example if you want to search for "foo bar" the index will tell you which files contain "foo" and which contain "bar". The cross-section of these results will be the files that contain both "foo" and "bar". You will have to scan those files directly to select those (if any) where "foo" and "bar" are right next to each other in the right order.

In any case, the amount of data that is downloaded to the client would be far less than downloading and scanning everything, although that would also depend on how your files are structured and what your search patterns look like.

这篇关于如何在不下载文件的情况下在Amazon S3存储桶中搜索文件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆