计算文档中字符串的唯一出现次数 [英] counting unique occurrences of string in document

查看:49
本文介绍了计算文档中字符串的唯一出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将日志文件读入Java.对于日志文件中的每一行,我正在检查该行是否包含ip地址.如果该行包含一个IP地址,那么我想对该日志文件中该IP地址显示的次数进行+1.如何在Java中完成此操作?

I am reading a logfile into java. For each line in the logfile, I am checking to see if the line contains an ip address. If the line contains an ip address, I want to then +1 to the count of the number of times that ip address showed up in the log file. How can I accomplish this in Java?

下面的代码成功地从包含ip地址的每一行中提取了ip地址,但是对ip地址的出现进行计数的过程不起作用.

The code below successfully extracts the ip address from each line that contains an ip address, but the process for counting occurrences of ip addresses does not work.

void read(String fileName) throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
    int counter = 0;
    ArrayList<IPHolder> ips = new ArrayList<IPHolder>();
    try {
        String line;
        while ((line = br.readLine()) != null) {
            if(!getIP(line).equals("0.0.0.0")){
                if(ips.size()==0){
                    IPHolder newIP = new IPHolder();
                    newIP.setIp(getIP(line));
                    newIP.setCount(0);
                    ips.add(newIP);
                }
                for(int j=0;j<ips.size();j++){
                    if(ips.get(j).getIp().equals(getIP(line))){
                        ips.get(j).setCount(ips.get(j).getCount()+1);
                    }else{
                        IPHolder newIP = new IPHolder();
                        newIP.setIp(getIP(line));
                        newIP.setCount(0);
                        ips.add(newIP);
                    }
                }
                if(counter % 1000 == 0){System.out.println(counter+", "+ips.size());}
                counter+=1;
            }
        }
    } finally {br.close();}
    for(int k=0;k<ips.size();k++){
        System.out.println("ip, count: "+ips.get(k).getIp()+" , "+ips.get(k).getCount());
    }
}

public String getIP(String ipString){//extracts an ip from a string if the string contains an ip
    String IPADDRESS_PATTERN = 
    "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
    Matcher matcher = pattern.matcher(ipString);
    if (matcher.find()) {
        return matcher.group();
    }
    else{
        return "0.0.0.0";
    }
}

持有人类别为:

public class IPHolder {

    private String ip;
    private int count;

    public String getIp(){return ip;}
    public void setIp(String i){ip=i;}

    public int getCount(){return count;}
    public void setCount(int ct){count=ct;}
}

推荐答案

在这种情况下,要搜索的关键字是HashMap. HashMap是键值对(在这种情况下为ips对及其计数)的列表.

The key word to search for is HashMap in this case. A HashMap is a list of key value pairs (in this case pairs of ips and their count).

"192.168.1.12" - 12
"192.168.1.13" - 17
"192.168.1.14" - 9

,依此类推. 与总是遍历容器对象数组以查找是否已经有该IP的容器相比,使用和访问要容易得多.

and so on. It is much easier to use and access than to always iterate over your array of container objects to find out whether there already is a container for that ip or not.

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(/*Your file */)));

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

String line = null;

while( (line = br.readLine()) != null) {

    // Iterate over lines and search for ip address patterns
    String[] addressesFoundInLine = ...;


    for(String ip: addressesFoundInLine ) {

        // Did you already have that address in your file earlier? If yes, increase its counter by 
        if(occurrences.containsKey(ip))
            occurrences.put(ip, occurrences.get(ip)+1);

        // If not, create a new entry for this address
        else
            occurrences.put(ip, 1);
    } 
}


// TreeMaps are automatically orered if their elements implement 'Comparable' which is the case for strings and integers
TreeMap<Integer, ArrayList<String>> turnedAround = new TreeMap<Integer, ArrayList<String>>();

Set<Entry<String, Integer>> es = occurrences.entrySet();

// Switch keys and values of HashMap and create a new TreeMap (in case there are two ips with the same count, add them to a list)
for(Entry<String, Integer> en: es) {

    if(turnedAround.containsKey(en.getValue()))         
        turnedAround.get(en.getValue()).add((String) en.getKey());
    else {
        ArrayList<String> ips = new ArrayList<String>();
        ips.add(en.getKey());
        turnedAround.put(en.getValue(), ips);
    }

}

// Print out the values (if there are two ips with the same counts they are printed out without an special order, that would require another sorting step)
for(Entry<Integer, ArrayList<String>> entry: turnedAround.entrySet()) {         
    for(String s: entry.getValue())
        System.out.println(s + " - " + entry.getKey());         
}

在我的情况下,输出如下:

In my case the output was the following:

192.168.1.19 - 4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28

我回答了

I answered this question about half an hour ago and I guess that is exactly what you are searching for, so if you need some example code, take a look at it.

这篇关于计算文档中字符串的唯一出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆