(c) 2002 Sholom M. Weiss and Nitin Indurkhya
Documentation
The Rule Induction Kit (RIK) is a complete software package for inducing highly compact decision rules from data. Unlike complex numerical models, these rules are simple, logic rules that are often highly predictive. For example in a medical application, typical rules might be high blood pressure or overweight suggest increased risk of heart attack. The objective is to determine the best set of rules for prediction and classification, where best is the smallest number of rules with a near-minimum error. All examples in the documentation are for the iris data set, a sample of 150 cases that is included in the package.
- Introduction
- Installation
- Data Format
- Running RIK
- RIK Modes
- Multi-class applications
- Interpreting the summary table
- Parameters in properties file
- Background publications on RIK
Introduction
RIK is a standalone rule-induction program that induces compact rules from data.Important Note: When using RIK, just type the program name with no arguments, and the program will display a description of the correct syntax.
Installation
RIK is available for PC-Linux, and Windows. A command line interface is used: The program name ("rik") is typed followed by a list of arguments. Complete documentation is available offline in the file rik.html.Linux Installation:
- Create a directory named rik ("mkdir rik")
- Download the file to rik and save as rik.tar.Z
- Type "uncompress rik.tar.Z"
- Type "tar -x -v -f rik.tar"
- Add rik to your path
Windows Installation:
Working in an MS-DOS window:
- Create a directory named rik ("mkdir c:\rik")
- Download the file to \rik and save as installrik.exe
- "cd \rik" and run installrik.exe by typing "installrik"
- Add directory \rik to your path and set environmental variable LFN to y.
- Windows 9x: edit autoexec.bat using "notepad c:\autoexec.bat"; add "c:\rik;" to the PATH line. (Alternative: add new line "PATH c:\rik;%PATH%" at the end of the file.) Also add new line "set LFN=y" (enables long file names). A single space follows the words PATH and set. These lines should contain no additional spaces.
- Windows NT/2000/XP: click on control panel (system folder/environment). Add "c:\rik;" (no spaces) to the environmental variable PATH. Create the environmental variable LFN and set it to a value of y.
- Notes: The RIK program is run from an MS-DOS (or command-prompt) window. The path and environment changes may not take effect until the next login or restart. A disk other than c: may be used.
Cleaning Up:
Following successful installation, the following files may be deleted or moved to backup storage:
- For Linux: rik.tar
- For Windows: installrik.exe
Data Format
Logical Data Organization Name Age Income Loan Amount Loan Repaid John Smith 33 $75,000 $50,000 yes ... ... ... ... ... Last case 45 $45,000 $25,000 no ===>
Standard Spreadsheet Form 33 75000 50000 1 0 ... ... ... ... ... 45 45000 25000 0 1 RIK processes data in a format we call standard spreadsheet form. To arrive at standard spreadsheet form, labels and text are removed, leaving a spreadsheet full of numbers (with the exception of a "?" for missing values) Each row corresponds to a case, and each column corresponds to some measurement taken for each case. Three types of values for data-fields are allowed:
Real numbers are rounded to 2 decimal places. If more precision is necessary, multiply the values by a scale factor (or convert to integers). An example of data in standard form is the file named iris. This dataset consists of 150 rows (cases) and 7 columns. Each class is represented as a separate output-column with true-or-false values. The spreadsheet has 4 input-columns (features) and 3 output-columns (classes). For classification goals, the number of output-columns is the same as the number of classes. Predictive performance is measured by error rate.
- ordered numerical, such as a continuous variable like age
- true or false: 1=true; 0=false
- missing value designated by a single "?" (without the quotes).
An optional names file can be provided. Each name is on a separate line. And there should be as many names as columns in the data file. Names file should be called "iris.names", if the datafile is "iris" or "iris.dat". If the names file is not provided, feature names F1, F2, etc will be used, and class names of C1, C2, etc will be used to refer to features and classes in the induced rules.
Running Rik
Type "rik" without any arguments to see the modes of opertaion. A file "rik.properties" determines some secondary characteristics of rule learning, If the file doesn't exist, it is created with default parameter values. This file is described in more detail later in the documentation.
The induced ruleset is written to standard output which can be redirected to a file; summary information about error-estimates and pruned rulesets is written to standard error (which can therefore be saved in a separate file).
The following log shows how rik can be used on sample data:
# This annotated log shows how rik can be run on sample data.
# The " $" is the linux shell prompt and precedes the user input.
# The "#" precedes comments (such as these).
# Output lines from rik are without a "#" or " $" character at the front
#
# The following line generates rules for the iris data, estimates the
# performance of different rulesets using 10-fold cross-validation,
# and prints out the "best" ruleset.
# Assume that rik.properties exists in the current directory.
#
$ rik -r 10 150 4 3 iris >iris.rul
Table of pruned rule sets
(* = minimum error; ** = within 1-SE of minimum error)
RSet Rules Vars Train Err Test Err Test SD MeanVar Err/Var 1 5 10 0.0000 0.0400 0.0160 9.3 0.00 2** 3 4 0.0200 0.0267 0.0132 4.0 0.50 3 3 3 0.0400 0.0600 0.0194 3.0 3.00 4 2 2 0.3333 0.3333 0.0385 2.0 44.00 5 1 1 0.6667 0.6667 0.0385 1.0 50.00
K-fold resampling, k=10
**********************************************************************
Selected rule set
1. F3<2.45 --> dx1
2. F4<1.65 & F3<4.95 --> dx2
3. [TRUE] --> dx3
#
# The ruleset is just a text file and can be readily viewed and edited.
#
$ cat iris.rul
Ruleset made using resampling mode. [0.5,0.02,4]
dx1
dx2
dx3
F3<2.45 --> dx1
F4<1.65 & F3<4.95 --> dx2
[TRUE] --> dx3
#
# We can apply the rules to new data (in this case though, we simply
# apply the rules to the training data).
#
$ rik -a iris.rul 150 4 3 iris >iris.res
case-by-case output on standard-output as: actual predicted
Table of pruned rule sets
(* = minimum error; ** = within 1-SE of minimum error)
RSet Rules Vars Train Err Test Err Test SD MeanVar Err/Var 1** 3 4 0.0200 0.0200 0.0114 0.0 0.50
Saved rule set
**********************************************************************
Selected rule set
1. F3<2.45 --> dx1
2. F4<1.65 & F3<4.95 --> dx2
3. [TRUE] --> dx3
#
# The case-by-case output shows actual and predicted classes.
#
$ cat iris.res
dx1 dx1
dx1 dx1
dx1 dx1
dx1 dx1
dx1 dx1
# ........
# all 150 lines are not shown here
dx3 dx3
dx3 dx3
dx3 dx3
dx3 dx3
dx3 dx3
#
# note that rik.properties can be edited and the program re-run
# with new parameters.
Rik Modes
Covering set of rules that makes no errors on all cases
Usage: rik -q #cases #features #classes datafile >ruleset Example: rik -q 150 4 3 iris >iris.cover
Produce summary table of different-sized, pruned rules
Usage: rik -p #cases #features #classes datafile >ruleset Example: rik -p 150 4 3 iris >iris.cover Note: The covering ruleset is written to standard output The table of pruned rulesets is written to standard error
Randomly selected training cases
Usage: rik -h pct #cases #features #classes datafile >ruleset Example: rik -h 66.7 150 4 3 iris >iris.ruls Note: Cases not used for training are used for testing The "best" ruleset (based on testset performance) is output
Separate test set of cases
Usage: rik -t n tfile #cases #features #classes datafile >ruleset Example: rik -t 50 iris.tst 150 4 3 iris >iris.ruls
Resampling with k-fold cross-validation
Usage: rik -r k #cases #features #classes datafile >ruleset Example: rik -r 10 150 4 3 iris >iris.ruls Note: The rules are obtained using all the data. Resampling is used to obtain test-set error-estimates.
Use linear discriminants as features
Usage: rik -l options #cases #features #classes datafile >ruleset Example: rik -l -r 10 150 4 3 iris >iris.ruls Note: This option is used in conjunction with the previous modes. Discriminant features are obtained one per class from all the data. The features are obtained by deriving Fisher's linear discriminant and obtaining its true/false classification for each class. These (boolean) features are treated just like the other features.
Output ruleset n (from table of prunes)
Usage: rik -s n options #cases #features #classes datafile >ruleset Example: rik -s 2 -r 10 150 4 3 iris >iris.ruls Note: This option is used in conjunction with the previous modes. Use this option to manually select a ruleset different from the one slected automatically as "best". Typically used in a second run of rik (the first run produces the table of pruned rulesets for examination).
Apply rules to new cases
Usage: rik -a rulefile #cases #features #classes datafile >results Example: rik -a iris.ruls 150 4 3 iris >result.log Note: case-by-case output: actual-class, predicted-class. Rulefile is an ascii file of rules generated by rik.
Multi-class applications
The rules are generated as a decision list in which the classes are ordered. The last class is always the default class and no rules are induced for it. Rules are invoked in the order of the induced rule set. First rule to fire wins. The class rules are induced in the specified class order. For two-class problems, one can try it both ways. For more than two classes, here are good strategies for class ordering:
- Use your knowledge of the application.
- Without prior knowledge, order by decreasing class size, i.e. largest first.
- For n classes, get results for each class vs. not class. Order by minimum predictive error for each class. If time is a problem, a small random subset of cases can be used for each of the n class subproblems.
Interpreting the summary table
A summary table of results is printed to standard error (usually the screen). Each rule set is numbered under the column "RSet." A single "*" delineates the rule set with the minimum error rate. A "**" indicates the rule set that is the minimum or is very close to the minimum but may be simpler than the minimum. MeanVar is the average number or variables of the resampled rule set that approximates in size the rule set for the full data. Err/var indicates the number of new errors per variable that were introduced when the previous rule set was pruned to the smaller size.
TABLE OF RULE SETS
RSet Rules Vars Train Err Test Err Test SD MeanVar Err/Var 1 5 10 0.0000 0.0400 0.0160 9.3 0.00 2** 3 4 0.0200 0.0267 0.0132 4.0 0.50 3 3 3 0.0400 0.0600 0.0194 3.0 3.00 4 2 2 0.3333 0.3333 0.0385 2.0 44.00 5 1 1 0.6667 0.6667 0.0385 1.0 50.00
K-fold resampling, k= 10
**********************************************************************
Selected Rule Set
1. F3<2.45 --> Class1
2. F4<1.65 & F3<4.95 --> Class2
3. [TRUE] --> Class3
Parameters in properties file
The file "rik.properties" is used to further control the program. If this file is not present, the program will create one with default values. The user can edit this file to change the defaults. The file contains helpful comments to assist the user in changing the defaults. The defaults can always be regenerated by deleting the file and letting the program recreate it. The default rik.properties is shown below:# default options. this file created by rik.
# comment lines have a '#' as the first character.
# other lines are of the type: option=value
#
# maxv (integer >=1) specifies the degree of smoothing (value reduction).
# high values result in less smoothing.
# if maxv is set to number of cases, no smoothing is done.
#
maxv=50
#
# se (real >=0) specifies how to define the "best" ruleset.
# se=f, "best" is the smallest within f std errors of min test-error ruleset.
# se=0, the "best" ruleset is the min test-error ruleset.
#
se=1
#
# prevalence-sort=2 orders the classes by increasing prevalence.
# prevalence-sort=1 orders the classes by decreasing prevalence.
# prevalence-sort=0 keeps the order as given
#
prevalence-sort=0
#
# short-rules=1 if quick short rules should be obtained.
# short-rules=0 if normal rules should be obtained (default).
#
short-rules=0
#
# df-threshold (real >0) is for selecting features in discriminant.
# high threshold ==> fewer features in discriminant.
#
df-threshold=4
#
# maxrul (integer >1) specifies the maximum number of rules generated.
#
maxrul=5000
#
# optimization-threshold (real >=0) is for optimizing pruned rulesets.
# higher values imply less frequent optimization and
# the program will run faster, but may produce weaker results.
# specifying -1 computes the default value (ncases/200).
#
optimization-threshold=-1
#
Background publications on RIK
Weiss, S. and Indurkhya, N. Optimized Rule Induction . IEEE Expert Journal, pages 61-69, number 6, volume 8, 1993.