统计文件中字符串的独特OCCURENCES [英] counting unique occurences of string in document

查看:125
本文介绍了统计文件中字符串的独特OCCURENCES的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读日志文件转换成Java。在日志文件的每一行,我检查,以查看是否该行包含一个IP地址。如果行包含一个IP地址,我要那么+1的次数的计数该IP地址在日志文件中出现了。我如何在Java中做到这一点?

下面的成功code提取从一个包含IP地址的每一行的IP地址,但对于计算IP地址的出现不能正常工作的过程。

 无效读(字符串文件名)抛出IOException异常{
    BR的BufferedReader =新的BufferedReader(新的InputStreamReader(新的FileInputStream(文件名)));
    INT计数器= 0;
    ArrayList的< IPHolder> IPS =新的ArrayList< IPHolder>();
    尝试{
        串线;
        而((行= br.readLine())!= NULL){
            如果(!getIP(线).equals(0.0.0.0)){
                如果(ips.size()== 0){
                    IPHolder newIP =新IPHolder();
                    newIP.setIp(getIP(线));
                    newIP.setCount(0);
                    ips.add(newIP);
                }
                对于(INT J = 0; J< ips.size(); J ++){
                    如果(ips.get(J).getIp()。等于(getIP(线))){
                        ips.get(j)条.setCount(ips.get(j)条.getCount()+ 1);
                    }其他{
                        IPHolder newIP =新IPHolder();
                        newIP.setIp(getIP(线));
                        newIP.setCount(0);
                        ips.add(newIP);
                    }
                }
                如果(计数器%1000年== 0){的System.out.println(计数器+,+ ips.size());}
                计数器+ = 1;
            }
        }
    } {最后br.close();}
    对于(INT K = 0; K< ips.size(); K ++){
        的System.out.println(IP,数:+ ips.get(K).getIp()+,+ ips.get(K).getCount());
    }
}公共字符串getIP(字符串ipString){//从字符串中提取一个ip如果字符串包含一个ip
    字符串IPADDRESS_PATTERN =
    (:( ?: 25 [0-5] | 2 [0-4] [0-9] |?[01] [0-9] [0-9])。\\\\){3}( ?:?25 [0-5] | 2 [0-4] [0-9] | [01] [0-9] [0-9]);    模式模式= Pattern.compile(IPADDRESS_PATTERN);
    匹配匹配= pattern.matcher(ipString);
    如果(matcher.find()){
        返回matcher.group();
    }
    其他{
        返回0.0.0.0;
    }
}

持有者类是:

 公共类IPHolder {    私人字符串知识产权;
    私人诠释计数;    公共字符串getIp(){返回的IP;}
    公共无效SETIP(字符串我){IP = I;}    公众诠释的getCount(){返回计数;}
    公共无效setCount(INT克拉){数=克拉;}
}


解决方案

的关键词搜索的HashMap是在这种情况下。
一个HashMap是键值对的列表(在这种情况下,对IPS和他们的计数)。

 192.168.1.12 -  12
192.168.1.13 - 17
192.168.1.14 - 9

和等。
这是很容易使用和访问,而不是总是遍历你的容器对象数组以找出是否已经存在该IP的容器或没有。

  BR的BufferedReader =新的BufferedReader(新的InputStreamReader(新的FileInputStream(/ *您文件* /)));HashMap的<字符串,整数>事件=新的HashMap<字符串,整数>();串线= NULL;而((行= br.readLine())!= NULL){    //迭代线,并搜索IP地址模式
    的String [] addressesFoundInLine = ...;
    对于(字符串IP:addressesFoundInLine){        //你已经在你的文件早些时候地址?如果是的话,增加其反
        如果(occurrences.containsKey(IP))
            occurrences.put(IP,occurrences.get(IP)+1);        //如果没有,创建此地址的新条目
        其他
            occurrences.put(IP,1);
    }
}
如果他们的元素实现'可比',这是对字符串和整数的情况下//树状图是自动orered
TreeMap的<整数,ArrayList的<串GT;> turnedAround =新TreeMap的<整数,ArrayList的<串GT;>();SET<钥匙进入LT;字符串,整数>> ES = occurrences.entrySet();//切换的HashMap的键和值,并创建一个新的TreeMap(如果有两个IPS使用相同的数,将它们添加到列表)
对于(进入<字符串,整数>于:ES){    如果(turnedAround.containsKey(en.getValue()))
        turnedAround.get(en.getValue())增加((字符串)en.getKey());
    其他{
        ArrayList的<串GT; IPS =新的ArrayList<串GT;();
        ips.add(en.getKey());
        turnedAround.put(en.getValue(),IPS);
    }}//打印出的值(如果有两个IPS与它们打印出来没有特殊顺序的相同数量,这将需要另一排序步骤)
对于(进入<整数,ArrayList的<串GT;>项:turnedAround.entrySet()){
    对于(一个String:entry.getValue())
        的System.out.println(S + - + entry.getKey());
}

在我的情况下,输出为以下内容:

  192.168.1.19  -  4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28

我回答<一个href=\"http://stackoverflow.com/questions/27325042/read-a-txt-file-and-return-a-list-of-words-with-their-frequency-in-the-file/27325105#27325105\">this问题的约半小时前,我想那就是你正在寻找什么,所以如果你需要一些例如code,看看吧。

I am reading a logfile into java. For each line in the logfile, I am checking to see if the line contains an ip address. If the line contains an ip address, I want to then +1 to the count of the number of times that ip address showed up in the log file. How can I accomplish this in Java?

The code below successfully extracts the ip address from each line that contains an ip address, but the process for counting occurrences of ip addresses does not work.

void read(String fileName) throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
    int counter = 0;
    ArrayList<IPHolder> ips = new ArrayList<IPHolder>();
    try {
        String line;
        while ((line = br.readLine()) != null) {
            if(!getIP(line).equals("0.0.0.0")){
                if(ips.size()==0){
                    IPHolder newIP = new IPHolder();
                    newIP.setIp(getIP(line));
                    newIP.setCount(0);
                    ips.add(newIP);
                }
                for(int j=0;j<ips.size();j++){
                    if(ips.get(j).getIp().equals(getIP(line))){
                        ips.get(j).setCount(ips.get(j).getCount()+1);
                    }else{
                        IPHolder newIP = new IPHolder();
                        newIP.setIp(getIP(line));
                        newIP.setCount(0);
                        ips.add(newIP);
                    }
                }
                if(counter % 1000 == 0){System.out.println(counter+", "+ips.size());}
                counter+=1;
            }
        }
    } finally {br.close();}
    for(int k=0;k<ips.size();k++){
        System.out.println("ip, count: "+ips.get(k).getIp()+" , "+ips.get(k).getCount());
    }
}

public String getIP(String ipString){//extracts an ip from a string if the string contains an ip
    String IPADDRESS_PATTERN = 
    "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
    Matcher matcher = pattern.matcher(ipString);
    if (matcher.find()) {
        return matcher.group();
    }
    else{
        return "0.0.0.0";
    }
}

The holder class is:

public class IPHolder {

    private String ip;
    private int count;

    public String getIp(){return ip;}
    public void setIp(String i){ip=i;}

    public int getCount(){return count;}
    public void setCount(int ct){count=ct;}
}

解决方案

The key word to search for is HashMap in this case. A HashMap is a list of key value pairs (in this case pairs of ips and their count).

"192.168.1.12" - 12
"192.168.1.13" - 17
"192.168.1.14" - 9

and so on. It is much easier to use and access than to always iterate over your array of container objects to find out whether there already is a container for that ip or not.

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(/*Your file */)));

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

String line = null;

while( (line = br.readLine()) != null) {

    // Iterate over lines and search for ip address patterns
    String[] addressesFoundInLine = ...;


    for(String ip: addressesFoundInLine ) {

        // Did you already have that address in your file earlier? If yes, increase its counter by 
        if(occurrences.containsKey(ip))
            occurrences.put(ip, occurrences.get(ip)+1);

        // If not, create a new entry for this address
        else
            occurrences.put(ip, 1);
    } 
}


// TreeMaps are automatically orered if their elements implement 'Comparable' which is the case for strings and integers
TreeMap<Integer, ArrayList<String>> turnedAround = new TreeMap<Integer, ArrayList<String>>();

Set<Entry<String, Integer>> es = occurrences.entrySet();

// Switch keys and values of HashMap and create a new TreeMap (in case there are two ips with the same count, add them to a list)
for(Entry<String, Integer> en: es) {

    if(turnedAround.containsKey(en.getValue()))         
        turnedAround.get(en.getValue()).add((String) en.getKey());
    else {
        ArrayList<String> ips = new ArrayList<String>();
        ips.add(en.getKey());
        turnedAround.put(en.getValue(), ips);
    }

}

// Print out the values (if there are two ips with the same counts they are printed out without an special order, that would require another sorting step)
for(Entry<Integer, ArrayList<String>> entry: turnedAround.entrySet()) {         
    for(String s: entry.getValue())
        System.out.println(s + " - " + entry.getKey());         
}

In my case the output was the following:

192.168.1.19 - 4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28

I answered this question about half an hour ago and I guess that is exactly what you are searching for, so if you need some example code, take a look at it.

这篇关于统计文件中字符串的独特OCCURENCES的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆