从java中的文本文件中提取短语 [英] extracting phrases from text file in java

查看:537
本文介绍了从java中的文本文件中提取短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用ADFA-LD数据集进行基于主机的入侵检测项目,现在我正在进行特征提取模块。我构建了由长度为4的系统调用短语组成的短语词典。现在,对于特征提取,我需要将短语与新的系统调用跟踪进行比较(以下是一些示例):

i'm doing project on host based intrusion detection using ADFA-LD dataset ,now i'm doing feature extraction module. i constructed the phrase dictionary which consists of system call phrases of length 4. And now for feature extraction ,i need to compare the phrases with the new system call traces (following are some samples):

sys_clock_gettime sys_poll sys_poll sys_clock_gettime sys_poll sys_poll
sys_poll sys_clock_gettime sys_poll sys_clock_gettime sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_socketcall .......

sys_clock_gettime sys_poll sys_poll sys_clock_gettime sys_poll sys_poll sys_poll sys_clock_gettime sys_poll sys_clock_gettime sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_poll sys_socketcall.......

我需要的是,如何将这些短语与新痕迹进行比较。我正在使用java。

What i need is, how can i compare these phrases with the new traces. i'm doing in java.

我的短语词典:

sys_socketcall-sys_poll-sys_clock_gettime-sys_poll

sys_socketcall-sys_poll-sys_clock_gettime-sys_poll

sys_clock_gettime-sys_poll-sys_poll-sys_socketcall

sys_clock_gettime-sys_poll-sys_poll-sys_socketcall

sys_poll-sys_socketcall-sys_poll-sys_clock_gettime

sys_poll-sys_socketcall-sys_poll-sys_clock_gettime

sys_poll-sys_clock_gettime-sys_clock_gettime-sys_clock_gettime

sys_poll-sys_clock_gettime-sys_clock_gettime-sys_clock_gettime

sys_clock_gettime-sys_clock_gettime-sys_socketcall-sys_clock_gettime

sys_clock_gettime-sys_clock_gettime-sys_socketcall-sys_clock_gettime

sys_socketcall-sys_clock_gettime-sys_poll-sys_poll

sys_socketcall-sys_clock_gettime-sys_poll-sys_poll

sys_poll-sys_poll

sys_poll-sys_poll

我使用' - '作为分隔符将这些短语与新痕迹进行比较,所以我用' - '加入了独特的系统调用。

i'm using '-' as separator on comparing these phrases with the new traces, so i joined unique system calls with '-'.

推荐答案

这似乎是你想要的单词按空格划分。在这种情况下,只需逐行读取您的文件,然后使用 String.split()获取您的文字。
以下是我可能想到的:

It seems like your desired words are divided by space. In that case just read your file line by line, and then get your words using String.split(" "). Here is the one i might think of:

public class FileSplitter {

    public static void main(String[] args) throws IOException {
        File file = new File("input_file.txt");
        LinkedList<String> words = new LinkedList<String>();
        int i = 0;

        Files.lines(file.toPath()).
        forEachOrdered(line -> words.
                addAll(Arrays.asList(line.split(" "))));

        for(String word:words){
            if(word.trim().length() > 0){
                System.out.print(word.trim() + " ");
                if(i++ >= 3){
                    System.out.println();
                    i = 0;
                }
            }
        }
    }
}

对于您的示例,它返回:

For your example it returns this:

sys_clock_gettime sys_poll sys_poll sys_clock_gettime 
sys_poll sys_poll sys_poll sys_clock_gettime 
sys_poll sys_clock_gettime sys_poll sys_poll 
sys_poll sys_poll sys_poll sys_poll
sys_poll sys_poll sys_socketcall 

这篇关于从java中的文本文件中提取短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆