C4.5 is a commonly used in decision tree algorithm in data mining for classification. The existing C4.5 algorithm implementation is running in serial way. We are implementing this algorithm using Hadoop MapReduce framework which can run parallel in multiple system. In this project we are comparing our result with Weka's result where C4.5 is serially implemented with different data source of different size.
Algorithm:CurrentNode is assumed for splitting.
Checks whether this instance belongs to CurrentNode or not.
For all uncovered attributes it outputs index and its value
and class label of instance.
counts number of occurrences of combination of ( index and
its value and class Label ) and prints count against it.
We calculate the Gain Ratio from the data available from
All the child (split) nodes that are made from parent node
are pushed on to queue.
Every Node is represented by a list of attribute indexes and
While(CurrentNode is not last Node in Queue)
if(Entropy!=0 we have some more uncovered attributes for
Here you can download sample code ofC4.5 algorithm in hadoop. Its just only a sample code without any optimization which can be used to learn how to code data mining algorithms using hadoop map reduce paradigm.
Download Source Code