If you are interested in tree based clasification models here is an application in Excel which lets you build such a model in Excel.
A note for the return visitors:
The Classification tree tool has been updated recently. The new version has now the rule generation capability which was not present earlier in the tool. You might like to download the updated version.
The earlier version
without the rule generation capability is still available for you to download.
If you also happen to be interested in Neural Networks try these
small Excel tools for Neural Network based -
Prediction ,
Classification
and
Clustering.
Once you download and unzip the file, you will find one Excel file inside the zip file. Before running the applications, I suggest that you go through the ReadMe worksheet. That contains brief instruction on how to run the application.
A small
tutorial
on the tool is provided in this site which walks you through one complete modeling step using the tool. You might like to study it.
Also look at the ProductionData worksheet of this
example Excel file
to see how you can classify new observations using the Tree built
by this tool.
An advice for the user - This application heavily uses Excel's
Advanced Filter and Pivot Table features. Before downloading the
application, please check whether your version of Excel
has these two features. If not, then you cannot run this
application on your version of Excel.
A few words about the Application ...
There are a host of very powerful and popular softwares
available in market which builds Classification / Decision trees.
Few giants in this area are -
CART,
See 5,
Answer Tree .
Apart from these biggies - there are host of others and
some of them are for free.
Here is another one - freely downloadable, easy to use (at least
that's what the developer would like to claim), performance
not too bad. You the user have the last say on whether it is
worth your time, effort, bandwidth, etc. etc. To decide whether
it's worth downloading, I suggest you read through the
rest of the page.
If you are looking at Decision trees for the first time, you need
to familiarize yourself a bit on the jargons. One easy way is
to search in Google. There are tons of material on this.
Other way is to get some introductory books from your nearest
library. While you are learing the basics and the jargons,
try downloading and playing with this tool. If nothing else -
just for fun.
If you are already using commercial decision tree softwares
like CART, See 5, SPSS - then perhaps you won't find much use
with this tool. Reason being - its capabilities are severly
limited compared to those commercial softwares.
This however may be good for a beginner - to have some initial idea
about Tree based modeling. It is not very big, doesn't need any other stuff to be installed on your machine (assuming you
already have MS-Excel on your machine), does not take much time
to learn how to use the tool, not too much hassle to prepare
the data in 'right format' to be used for the tool.
You may think of using the tool if ....
-
You have read and heard about Decision Trees, didn't quite get a chance to work with a tool that does the modeling for you.
Here is a tool, freely downloadable, easy to learn and use.
-
You are teaching Classification / Decision Trees models
to your students and thought that a small easy tool would
have come real handy for your classroom demo. A tool that
does not require special installations to be done , does not
require licences to be bought, quick and easy to distribute
on students' machines, does not requires data files to be
prepared in special formats and can quickly run a bunch
of examples for the students. You might consider this tool
for your class.
-
You are an analyst in the industry - heard a lot about Data Mining, Tree Classifiers and stuff but currently have no
tools to just get you started. You may buy a commercial software
in future but before that you wish you could have some idea
on what it does. Then try downloading and playing with the
tool - see if it helps you to know and experience the stuff
a little bit more.
Here are some nice features of the application ......
-
The columns for input and output variables in your data need not reside in contiguous columns. What I mean is this. Assume that you have ten columns in your data set. You can tell the application that columns 2,3, 5 and 10 are your predictors, column 8
is the Class variable and the rest of the columns you don't want to use This comes in very handy if you want to try out various models on the same data set keeping and droping different predictors for different models.
-
You can break up your data set into
training and
test set.
You can build your model on the training set and test its performance
on test set. You can either
randomly select a portion
of your data as test set or you can specify that the last few rows of your data should be treated as the test set.
-
Application has some limited capability of handling
missing data . For continuous variable, any non-number in any of your data column will be treated as missing value. Application will replace all the missing
values in a column by respective column medians.
For categorical variables, any blank cell or
any cell containing an Excel error will be treated as missing value and application will replace it with the most frequently occuring category.
-
Application has capability of growing a large tree and
then pruning
it back to generate a stable tree.
-
After the model is built, application reports Confusion Matrices on your training / test data.
-
The application has a
viewer sheet which lets you see graphically the
class distribution in each node of the tree.
-
The application generates a set of decision rules. These are IF - THEN rules to describe the classes. The rules give you insight about the structure of the classification problem. For each rule, three performance measures ( support, confidence, capture) are reported. These measures can be used to asses the quality of the rules. Application also generates a rule plot. Rule plot visually presents the quality of the rules. The plot also tells you how effective are the rules in separating out the classes.
-
Once you have built the tree, you have the option of saving the model in a separate workbook.
This might not seem like a big deal. After all we all know how to cut-paste from one workbook to another. However, this option means that you have a tree model nicely set up in an Excel file, cells connected with formulae, all ready to go.
You can just feed-in a new input in a specific range of cells and you get the model output in another range of cells. Though it's not a big deal, setting up these formulae
could be a bit of nuisance - which the application performs happily. I guess you will better appreciate this feature, when you actually use it. You can distribute this new workbook to others. That serves
as a fully functional tree model. They can use it to classify their
data. They need not have the application that built the model with them
to be able to use the tree model that you built.
Once you save the final tree in a new workbook, with a little bit of
work, you can set it up to classify new observations using the
tree you have just built. To see how to do it, study
the ProductionData worksheet of this
example Excel file containing the
output of a Tree building exercise.
When you save the tree in a new workbook, along with the tree
following things are also saved.
- The Training and Test data use to build and validate the tree.
- The predicted classes for the Training / Test data.
- The different settings that were used to build the tree.
- Graphical representation of the class distributions in various nodes of the tree.
Stuff that are not so nice ......
This is really a long list, which tells you that the capability of the
tool is rather limited. I will mention some important ones.
-
Rule Generation - Application generates the decision rules. However, the rules are basically for viewing and gaining the insight into the classification problem. Currently the application has no capability of using the rules to classify new data points. C4.5 provides a very smart way of combining the decisions from each individual rules to come up with a single decision for a new data point. This is a not-so-easy task since for a new data point different rules may predict different classes. The trick is how to reconcile and combine the decisions from individual rules. This capability is absent in the tool. If I can manage it, I will try to include this capability in some form in future - but no guarantees.
-
Missing Value handling - is rather
crude in this application. Before growing the tree, for a given
predictor - it globally replaces the missing values.
Ideally it should be done at tree growing stage. Depending on
which node you are in currently - you should replace the missing value
with appropriate number. This does not happen currently.
The final tree that the application builds, have no capability
of hadling missing value. C4.5 - C5.0 specifies sophisticated
techniques for classification using a tree when some predictor values
are missing. That capability is absent in this tool.
-
Missclassification Cost handling -
Application have no capability to handle missclassification
cost. In effect, it assumes same cost for all missclassifications.
Also the application can't adjust for over sampling / under sampling
of class categories.
-
Size restriction - The application can handle at most 50 predictors variables. For the categorical predictors, you cannot have more than 20 categories. The maximum number of observations (= training set + test set) it can handle is 10,000.
-
Speed - This is an implementation of the very basic C4.5 algorithm. VBA implementation is perhaps a lot slower than C / C++ implementation. It has been developed and tested in a machine with 128MB RAM and in Excel 97.
For about 1000 training records, 5 predictors it takes about
a minute. For 10,000 training records, 9 predictors, it takes about
15 minutes.
For the more techy and inquisitive guys ......
-
Algorithm Application uses
C4.5 algorithm by
Ross Quinlan
.
-
Node Splitting Criterion
Application uses Entropy
based criterion to select the split.
While growing the tree, at any point a predictor is chosen to
split a node such that the Information Gain is maximized after
the split. As specified in C4.5, it actually uses the
Gain Ratio (= Gain / Split Info)
to choose the split. You have the option of using Gain instead to Gain Ratio, in which case the Split Info is foricbly set to 1.
-
Stopping Criteria
Application stops splitting a node and declares it as a
leaf node is any one of the following ctirerion is met.
-
Number of records in the node is less than some pre-specified limit.
-
Purity of the node is more than some pre-specified limit p.
This means that the proportion of records in the node with
class equal to the majority class is p or more.
-
Depth of the node is more than some pre-specified limit.
-
Predictor values for all records are identical.
-
Tree Pruning
is based on the pessimistic error
rate at the node. At the node a 50% confidence interval of the error rate is computed and its upper limit is taken as the pessimistic error rate.
If the pessimistic error rate of a node is less than that of the
subtree rooted at that node, the node is pruned. If we fail to prune a node - none of its predecessors is pruned.
-
Rule Generation
is according to the methods mentioned in C4.5. For each leaf node in the tree, the path from root to that node gives a rule. Thus if we have k nodes in the tree we start with a set of k rules. Then individual rules are pruned by dropping cluases one by one from that rule. The decision on whether or not to drop a clause is taken based on the outcome of a statistical independence test. The test tries to see whether keeping or droping a clause is independent of the final decision of the rule. If it is independent, then that shows that the clause is NOT contributing towards the final decision and hence we drop the clause to simplify the rule. The independence tests used are Chi-square test and Fisher's exact test.
That's it. I have told you all that's there to say about this
application. Now if you are not already put off by the long
list of drawbacks - go ahead and use it have fun.
As the developer I think that the application is good for small experiments and definitely pretty good just for fun. Wish you a happy experience with it.
I would really appreciate if you could
mail me
your comments and feedback.
|
Home |
Resume |
Thesis Abstract |
PET |
Neural Network in Excel |
Clustering in Execl |
Classification Tree in Excel |