计算文档中字符串的唯一出现次数 [英] counting unique occurrences of string in document

查看:28
本文介绍了计算文档中字符串的唯一出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将日志文件读入 java.对于日志文件中的每一行,我正在检查该行是否包含一个 IP 地址.如果该行包含一个 ip 地址,我想然后 +1 到该 ip 地址在日志文件中出现的次数的计数.我如何在 Java 中完成此操作?

I am reading a logfile into java. For each line in the logfile, I am checking to see if the line contains an ip address. If the line contains an ip address, I want to then +1 to the count of the number of times that ip address showed up in the log file. How can I accomplish this in Java?

下面的代码成功地从包含ip地址的每一行中提取了ip地址,但是计算ip地址出现次数的过程不起作用.

The code below successfully extracts the ip address from each line that contains an ip address, but the process for counting occurrences of ip addresses does not work.

void read(String fileName) throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
    int counter = 0;
    ArrayList<IPHolder> ips = new ArrayList<IPHolder>();
    try {
        String line;
        while ((line = br.readLine()) != null) {
            if(!getIP(line).equals("0.0.0.0")){
                if(ips.size()==0){
                    IPHolder newIP = new IPHolder();
                    newIP.setIp(getIP(line));
                    newIP.setCount(0);
                    ips.add(newIP);
                }
                for(int j=0;j<ips.size();j++){
                    if(ips.get(j).getIp().equals(getIP(line))){
                        ips.get(j).setCount(ips.get(j).getCount()+1);
                    }else{
                        IPHolder newIP = new IPHolder();
                        newIP.setIp(getIP(line));
                        newIP.setCount(0);
                        ips.add(newIP);
                    }
                }
                if(counter % 1000 == 0){System.out.println(counter+", "+ips.size());}
                counter+=1;
            }
        }
    } finally {br.close();}
    for(int k=0;k<ips.size();k++){
        System.out.println("ip, count: "+ips.get(k).getIp()+" , "+ips.get(k).getCount());
    }
}

public String getIP(String ipString){//extracts an ip from a string if the string contains an ip
    String IPADDRESS_PATTERN = 
    "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
    Matcher matcher = pattern.matcher(ipString);
    if (matcher.find()) {
        return matcher.group();
    }
    else{
        return "0.0.0.0";
    }
}

持有者类是:

public class IPHolder {

    private String ip;
    private int count;

    public String getIp(){return ip;}
    public void setIp(String i){ip=i;}

    public int getCount(){return count;}
    public void setCount(int ct){count=ct;}
}

推荐答案

在这种情况下,要搜索的关键字是 HashMap.HashMap 是键值对的列表(在本例中为 ips 对及其计数).

The key word to search for is HashMap in this case. A HashMap is a list of key value pairs (in this case pairs of ips and their count).

"192.168.1.12" - 12
"192.168.1.13" - 17
"192.168.1.14" - 9

等等.使用和访问比总是遍历容器对象数组以找出该 IP 是否已经存在容器要容易得多.

and so on. It is much easier to use and access than to always iterate over your array of container objects to find out whether there already is a container for that ip or not.

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(/*Your file */)));

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

String line = null;

while( (line = br.readLine()) != null) {

    // Iterate over lines and search for ip address patterns
    String[] addressesFoundInLine = ...;


    for(String ip: addressesFoundInLine ) {

        // Did you already have that address in your file earlier? If yes, increase its counter by 
        if(occurrences.containsKey(ip))
            occurrences.put(ip, occurrences.get(ip)+1);

        // If not, create a new entry for this address
        else
            occurrences.put(ip, 1);
    } 
}


// TreeMaps are automatically orered if their elements implement 'Comparable' which is the case for strings and integers
TreeMap<Integer, ArrayList<String>> turnedAround = new TreeMap<Integer, ArrayList<String>>();

Set<Entry<String, Integer>> es = occurrences.entrySet();

// Switch keys and values of HashMap and create a new TreeMap (in case there are two ips with the same count, add them to a list)
for(Entry<String, Integer> en: es) {

    if(turnedAround.containsKey(en.getValue()))         
        turnedAround.get(en.getValue()).add((String) en.getKey());
    else {
        ArrayList<String> ips = new ArrayList<String>();
        ips.add(en.getKey());
        turnedAround.put(en.getValue(), ips);
    }

}

// Print out the values (if there are two ips with the same counts they are printed out without an special order, that would require another sorting step)
for(Entry<Integer, ArrayList<String>> entry: turnedAround.entrySet()) {         
    for(String s: entry.getValue())
        System.out.println(s + " - " + entry.getKey());         
}

就我而言,输出如下:

192.168.1.19 - 4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28

我回答了这个问题 大约半小时前,我想这正是您要搜索的内容,所以如果您需要一些示例代码,请查看它.

I answered this question about half an hour ago and I guess that is exactly what you are searching for, so if you need some example code, take a look at it.

这篇关于计算文档中字符串的唯一出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆