寻找一个C ++实现的C4.5算法 [英] Looking for a C++ implementation of the C4.5 algorithm

查看:115
本文介绍了寻找一个C ++实现的C4.5算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找一个C ++实现的 C4.5算法的,但我一直没能找到一个还没有。我发现昆兰的 C4.5版本8 ,但它是用C写的......也有人看到任何打开来源C ++的C4.5算法的实现?

I've been looking for a C++ implementation of the C4.5 algorithm, but I haven't been able to find one yet. I found Quinlan's C4.5 Release 8, but it's written in C... has anybody seen any open source C++ implementations of the C4.5 algorithm?

我在想移植的 J48源$ C ​​$ C (或简单地写周围的C版本的包装),如果我不能找到一个开源C ++实现在那里,但我希望我没有做到这一点!请让我知道如果你遇到一个C ++实现的算法。

I'm thinking about porting the J48 source code (or simply writing a wrapper around the C version) if I can't find an open source C++ implementation out there, but I hope I don't have to do that! Please let me know if you have come across a C++ implementation of the algorithm.

我一直在考虑写一个薄C ++包装周围的C实现C5​​.0算法的选择(的 C5.0是C4.5 的改进版)。我下载和编译的C实现C5​​.0算法,但它看起来并不像它的容易移植到C ++。 C实现使用了大量的全局变量和简单地写周围的C函数薄薄的C ++封装不会造成一个面向对象的设计,因为每个类实例将被修改相同的全局参数。换句话说:我不会有任何封装,这是一个pretty的基本的东西,我需要

I've been considering the option of writing a thin C++ wrapper around the C implementation of the C5.0 algorithm (C5.0 is the improved version of C4.5). I downloaded and compiled the C implementation of the C5.0 algorithm, but it doesn't look like it's easily portable to C++. The C implementation uses a lot of global variables and simply writing a thin C++ wrapper around the C functions will not result in an object oriented design because each class instance will be modifying the same global parameters. In other words: I will have no encapsulation and that's a pretty basic thing that I need.

为了获得封装我需要做的C code进行完全成熟的端口为C ++,这大约是一样的为C ++的Java版本(J48)移植。

In order to get encapsulation I will need to make a full blown port of the C code into C++, which is about the same as porting the Java version (J48) into C++.

下面是一些具体的要求:

Here are some specific requirements:

  1. 在每个分类实例必须(除了那些不变,即没有全局变量)封装自己的数据。
  2. 支持分类的并发培训和分类的同时评估。

下面是一个很好的情景,我做了10倍交叉验证,我想同时培养各自的训练集片10决策树。如果我只是运行的C程序为每个切片,我将不得不运行10个进程,这是不可怕的。但是,如果我需要进行分类数以千计的实时数据样本,那么我将不得不开始对每个样品我要分类一个新的进程,这就是效率不高。

Here is a good scenario: suppose I'm doing 10-fold cross-validation, I would like to concurrently train 10 decision trees with their respective slice of the training set. If I just run the C program for each slice, I would have to run 10 processes, which is not horrible. However, if I need to classify thousands of data samples in real time, then I would have to start a new process for each sample I want to classify and that's not very efficient.

推荐答案

我可能已经找到了的可能C ++的C5.0(See5.0)执行,但我一直没能挖掘到源$ C ​​$ C,足以确定它是否真的像宣传的那样。

I may have found a possible C++ "implementation" of C5.0 (See5.0), but I haven't been able to dig into the source code enough to determine if it really works as advertised.

要重申,我原来的顾虑,港口笔者指出以下有关C5.0算法:

To reiterate my original concerns, the author of the port states the following about the C5.0 algorithm:

与See5Sam [C5.0]的另一个缺点是不可能有多于   一个应用树在相同的时间。应用程序从读   每个运行可执行文件,并存储在全局时间文件   变量在这里和那里。

Another drawback with See5Sam [C5.0] is the impossibility to have more than one application tree at the same time. An application is read from files each time the executable is run and is stored in global variables here and there.

我会尽快更新我的答案,因为我得到一些时间寻找到源$ C ​​$ C。

I will update my answer as soon as I get some time to look into the source code.

它看起来pretty的好,这里是C ++接口:

It's looking pretty good, here is the C++ interface:

class CMee5
{
  public:

    /**
      Create a See 5 engine from tree/rules files.
      \param pcFileStem The stem of the See 5 file system. The engine
             initialisation will look for the following files:
              - pcFileStem.names Vanilla See 5 names file (mandatory)
              - pcFileStem.tree or pcFileStem.rules Vanilla See 5 tree or rules
                file (mandatory)
              - pcFileStem.costs Vanilla See 5 costs file (mandatory)
    */
    inline CMee5(const char* pcFileStem, bool bUseRules);

    /**
      Release allocated memory for this engine.
    */
    inline ~CMee5();

    /**
      General classification routine accepting a data record.
    */
    inline unsigned int classifyDataRec(DataRec Case, float* pOutConfidence);

    /**
      Show rules that were used to classify the last case.
      Classify() will have set RulesUsed[] to
      number of active rules for trial 0,
      first active rule, second active rule, ..., last active rule,
      number of active rules for trial 1,
      first active rule, second active rule, ..., last active rule,
      and so on.
    */
    inline void showRules(int Spaces);

    /**
      Open file with given extension for read/write with the actual file stem.
    */
    inline FILE* GetFile(String Extension, String RW);

    /**
      Read a raw case from file Df.

      For each attribute, read the attribute value from the file.
      If it is a discrete valued attribute, find the associated no.
      of this attribute value (if the value is unknown this is 0).

      Returns the array of attribute values.
    */
    inline DataRec GetDataRec(FILE *Df, Boolean Train);
    inline DataRec GetDataRecFromVec(float* pfVals, Boolean Train);
    inline float TranslateStringField(int Att, const char* Name);

    inline void Error(int ErrNo, String S1, String S2);

    inline int getMaxClass() const;
    inline int getClassAtt() const;
    inline int getLabelAtt() const;
    inline int getCWtAtt() const;
    inline unsigned int getMaxAtt() const;
    inline const char* getClassName(int nClassNo) const;
    inline char* getIgnoredVals();

    inline void FreeLastCase(void* DVec);
}

我会说,这是我迄今为止发现的最好的选择。

I would say that this is the best alternative I've found so far.

这篇关于寻找一个C ++实现的C4.5算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆