EDM: Enterprise Data-Miner

(c) 1997-2002  Sholom M. Weiss and Nitin Indurkhya

The Enterprise Data-Miner (EDM) is a collection of standalone programs that perform most of the data-mining tasks. All examples below are for the iris data set, a sample of 150 cases that is included in the package. A variety of other datasets are also provided.




EDM Java Version Certified 100% Pure Java Logo

Take advantage of the latest software technology with a Java version of EDM.

Run EDM as a JDK1.1 GUI application

Certified 100% Pure Java

Run EDM from Netscape Communicator

Run EDM from Internet Explorer

Run client-based EDM on server-based data

Introduction

The EDM programs are basic building blocks that can be used to perform complex predictive data mining. In the standard version, scripts are provided as part of the package, and you may compose your own scripts. In the Java version, the scripts are replaced by Java programs that do not invoke a shell or its utilities.

Important Note: When using the EDM programs, just type the program name with no arguments, and the program will display a description of the correct syntax.

Installation

EDM is available for Unix, Windows 9x/NT/2000 and Java. The platforms for the Unix versions are listed in the web page quick summary. Functionally, EDM is the same across all platforms. A command line interface is used: A program name is typed followed by a list of arguments. The Java version also has a Graphical User Interface (GUI) and can be fully run as an applet using Netscape or Internet Explorer. Complete documentation is available offline in the file edm.html.

Unix Installation:

  1. Create a directory named edm ("mkdir edm")
  2. Download the file to edm and save as edm.tar.Z
  3. Type "uncompress edm.tar.Z"
  4. Type "tar -x -v -f edm.tar"
  5. Add edm to your path

Windows Installation:

Working in an MS-DOS window:

  1. Create a directory named edm ("mkdir c:\edm")
  2. Create a directory named tmp ("mkdir c:\tmp")
  3. Download the file to \edm and save as installedm.exe
  4. "cd \edm" and run installedm.exe by typing "installedm"
  5. Add directory \edm to your path and set environmental variable LFN to y.
  6. Optional but advantageous is to change the MS-DOS window "memory property" for DPMI to 65535

Optional GNU Unix Utilities for Windows:

The EDM programs are standalone programs that run without Unix utilities. However, the scripts are processed by Unix utilities. The scripts are files that do not have any extension such as "exe". To enable running the scripts on a PC, run installgnu.exe (i.e. type "installgnu") from within the edm directory, which extracts high-quality GNU Unix-like (djgpp-version) utility programs that can run scripts. Unlike EDM, these gnu programs are free and may be distributed by anyone free of charge. See \edm\gnulicense for full details. Installgnu.exe contains just the unmodified subset of utilities used by our scripts. You may retrieve the full set of utilities including source from the djgpp link.

Java Installation:

EDM can be run as either a Java applet or a Java application. To run as an applet, all you need is a browser with an up-to-date Java Virtual Machine conforming to JDK 1.1: Either of these browsers is fully sufficient to run EDM without an external Java Development Kit. EDM can also run as a standalone Java application with either a command line interface or GUI. An application (not an applet) expects the prior installation of a Java virtual machine such as the Java Development Kit (JDK 1.1) or Runtime Environment (JRE) .
  1. Create a directory named edm, and download the file to directory edm.
  2. Add the full path for edm.zip to the environment variable CLASSPATH. For example with Windows 9x, add "set classpath=c:\edm\edm.zip" to the autoexec.bat file. DO NOT UNZIP edm.zip.
  3. To run as an applet, open the local file "edmapp" within the browser. To run as an application using a Java Development Kit, type (a)"java edm" for the GUI or (b)prefix the standard command line interface with the word "java".

Datasets

The datasets provided with EDM are downloaded separately from http://www.data-miner.com/download.html. Save as edmdata.zip and unzip in a separate directory.

Cleaning Up:

Following successful installation, the following files may be deleted or moved to backup storage:

Java EDM

The Java version can be run as an applet within a browser (Netscape or Microsoft IE) or as an application (JDK).

Data Format

Logical Data Organization
Name AgeIncome Loan AmountLoan Repaid
John Smith33$75,000$50,000yes
...............
Last case45$45,000$25,000no
===>
Standard Spreadsheet Form
33750005000010
...............
45450002500001

The EDM programs process data in a format we call standard spreadsheet form. The editing programs can help transform data, such as text or unordered codes, into standard spreadsheet form. To arrive at standard spreadsheet form, labels and text are removed, leaving a spreadsheet full of numbers. Each row corresponds to a case, and each column corresponds to some measurement taken for each case. Two types of values for data-fields are allowed:

An example of data in standard form is the file named iris. This dataset consists of 150 rows (cases) and 7 columns. Each class is represented as a separate output-column with true-or-false values. The spreadsheet has 4 input-columns (features) and 3 output-columns (classes). For classification goals, the number of output-columns is the same as the number of classes. For regression, the goal is a single continuous variable with a corresponding single output-column. When only one goal is specified in the spreadsheet, the application is regression; multiple goals are classification. Predictive performance is measured by error rate for classification, and mean-squared-error for regression. The one exception is a neural net which accepts multiple continuous goals and measures error for both classification and regression.

Datasets

The application datasets are from the UCI repository http://www.ics.uci.edu/~mlearn/MLRepository.html. The table below summarizes the characteristics of these data. The number of features describes numerical features and categorical variables decomposed into binary features. Our principle objective is data mining, so we selected those datasets with relatively large numbers of training cases and an independent set of test cases. A noise datasets is also included: a set of random numbers with a .3 prior for the smaller class.

Name

Train

Test

Features

Classes

adult

30162

15060

105

2

blackj

5000

10000

6

2

coding

5000

15000

60

2

digit

7291

2007

256

10

dna

2000

1186

180

3

isolet

6238

1559

617

26

led

5000

5000

24

10

letter

16000

4000

16

26

move

1483

1546

76

2

noise

5000

5000

20

2

satellite

4435

2000

36

6

splice

2175

1000

240

2

wave

5000

5000

40

3

Data Preparation

Editing

Verify or Reduce Spreadsheet

NEDIT:Verify correctness of input-data. Delete or keep specified columns.
Usage: nedit rows columns -flag column-file <input-data >output-data
Example: nedit 150 7 -d cols.del <iris >iris.eg1
Note: flag=d for deleting columns; flag=k for keeping columns.
Column-file contains numbers of columns to be deleted or kept.
If no column-file specified, rewrite input-data in standard form.
Input-data is a stream of numbers not necessarily in standard form.

Merge Spreadsheets

NMERGE: Merge two spreadsheets by concatenating corresponding rows.
Usage: nmerge spreadsheet-1 spreadsheet-2 >merged-spreadsheet
Example: nmerge iris.4 iris.1 >iris.all
Note: (script)

Decode Column

DECODE: Decode specified columns into true-or-false features.
Usage: decode rows columns decode-file <input-data >output-data
Example: decode 150 5 decode.col <iris.org >iris.new
Note: Decode-file contains triples of column number, min value and max value.

Normalization

NORM: Normalize data to within -1 to +1.
Usage: norm rows columns precision norm-file <input-data >output-data
Example: norm 150 7 3 norm.val <iris >iris.nor
Note: Norm-file contains the column divisors.
If norm-file non-existent, divisors generated from data.

Text Mining

MKDICT: Create a dictionary of most frequently occuring words in given text.
Usage: mkdict number-of-words stopwords-file input-text >dictionary
Example: mkdict 5 stops cases.txt >dict
Note: Words are alphabetic strings (case-insensitive).
Words in stopwords-file (one per line) ignored for dictionary.
Dictionary is written in lowercase, alphabetized, one word per line.
(script)

TEXT: Convert free text into spreadsheet with features as dictionary words.
Usage: text -flag dictionary case-delimiter <free-text >output-data
Example: text -c dict 10 <cases.txt >dict.ssf
Note: Flag = c ==> counts for dictionary words in output,
= i ==> true-or-false features for presence of dictionary words.
Case-delimiter must be specified in decimal ASCII (e.g. 10 for newline).
Words are alphabetic strings (case-insensitive).
Dictionary words are in lowercase letters and alphabetic order, one word per line.

Data Segmentation

SEGMENT: Segment a specified column into a specified number of clusters.
Usage: segment rows columns segment-column num-clusters <input-data >output-data
Example: segment 150 7 1 3 <iris >iris.seg
Note: Segment-column is replaced by true-or-false features corresponding to each cluster.

Clustering and Assigning Labels

CLUSTER: Assign rows to k clusters. Uses multivariate k-means clustering.
Usage: cluster rows input-columns k <input-data >output-data
Example: cluster 150 4 3 <iris.4 >iris.cls
Note: Output-data consists of true-or-false labels for each of the k clusters.
Input-data should be normalized when column values are measured on very different scales.
Pick best k by using mean within-cluster error as a guide.
Use MERGE to combine output labels with the input-data.

Feature Reduction and Selection

Independent Significance

SIGNIF: Feature selection by significance testing (with feature independence).
Usage: signif rows input-columns output-columns sig <input-data >feature-nums
Example: signif 150 4 3 2.5 <iris >iris.s25
Note: Sig used in the significance test and defaults to 2 if not specified.
Applicable only to classification problems (output-columns>1).

Joint Significance by Tree Selection

MSIG: Feature selection using a tree.
Usage: msig rows input-columns output-columns sig <input-data >feature-nums
Example: msig 150 4 3 2.5 <iris >iris.f25
Note: Sig controls the amount of pruning and defaults to 4 if not specified.
(script)

Value Reduction and Smoothing

Rounding

ROUND: Smooth values by roundoff to a maximum of maxv distinct values per input-column.
Usage: round rows input-columns output-columns maxv <input-data >output-data
Example: round 150 4 3 15 <iris >iris.smo
Note: Values in output-columns are preserved and not subject to smoothing.
Rounding to a maximum of two decimal places.

K-Means Clustering

KMEANS: Smooth values by k-means clustering to a maximum of maxv distinct values per input-column.
Usage: km rows input-columns output-columns maxv <input-data >output-data
Example: km 150 4 3 15 <iris >iris.km
Note: Values in output-columns are preserved and not subject to smoothing.

Class Entropy

ENT: Smooth values by entropy-based binning.
Usage: ent rows input-columns output-columns thresh <input-data >output-data
Example: ent 150 4 3 7 <iris >iris.ent
Note: Values in output-columns are preserved and not subject to smoothing.
Thresh>1 ==> thresh = maximum distinct values per input-column.
Thresh<1 ==> thresh = gain threshold for incremental bin addition.
Thresh defaults to 0.01 if unspecified.
Applicable only to classification problems (output-columns>1).

Case Reduction and Sampling

Random Samples

RANDOM: Random subsample from spreadsheet,
Usage: random rows columns percent seed-file <input-data >output-data
Example: random 150 7 67 seed.67 <iris >iris.trn
Note: If seed-file non-existent, clock used to generate a seed.
If seed-file unspecified, seed is not saved.
percent=100 ==> bootstrap sample.

NSPLIT: Obtain the complement of rows selected by RANDOM.
Usage: nsplit rows columns seed-file <input-data >output-data
Example: nsplit 150 7 seed.67 <iris >iris.tst
Note: Seed-file must have been generated by RANDOM.

.
RANDOMWT: Random sampling based on specified case probabilities
Usage: randomwt rows columns percent flag missfile seed-file <input-data >output-data
Example: randomwt 150 7 67 3 iris.mis seed.67 <iris >iris.tr2
Note: If seed-file non-existent, clock used to generate a seed.
If seed-file unspecified, seed is not saved.
Flag = 1 for regression missfile; >1 for classification missfile.
Missfile contains cumulative misses for each row.
Non-existent missfile ==> all rows equally probable.

ARCWTS: Obtain error count for each row (case) in spreadsheet.
Usage: arcwts rows output-columns missfile <new-data
Example: arcwts 150 3 iris.mis <iris.res
Note: New-data must consist of triples: row#,actual,predicted.
Missfile contains errors for each row (updated with new-data).

Voting and Averaging Solutions

AVEVOTE: Combine solutions by averaging (for regression) or voting (for classification).
Usage: avevote rows output-columns <vote-file >results
Example: avevote 150 3 <iris.vot >iris.cas
Note: Vote-file must consist of triples: row#,actual,predicted.
In voting, ties are resolved in favor of the lowest numbered goal.
Case-by-case output: row#,actual-answer,predicted-answer.

Bagging - Bootstrap Resampled and Combined Solutions

BAG: Combine (by averaging or voting) results by random subsampling (bagging).
Usage: bag trainrows trainfile testrows testfile input-columns output-columns percent iterations sig
Example: bag 100 iris.trn 50 iris.tst 4 3 40 10 1.5 >bag.log
Note: Script uses tree method; other methods can be substituted.
Output consists of performance of individual trees and combined results.
Samples with replacement.
Sig defaults to 0 if unspecified (sig=0 ==> no pruning).
Java class: a file may be specified for saving trees generated in each iteration
(script)

Boosting - Adaptively Resampled and Combined Solutions

BOOST: Combine (by averaging or voting) results by adaptive subsampling (boosting).
Usage: boost trainrows trainfile testrows testfile input-columns output-columns percent iterations sig
Example: boost 100 iris.trn 50 iris.tst 4 3 40 10 1 >boost.log
Note: Script uses tree method; other methods can be substituted. Output
consists of performance of individual trees and combined results.
Samples with replacement.
Sig defaults to 1.5 if unspecified (specify sig=0 for no pruning).
Java class: a file may be specified for saving trees generated in each iteration
(script)

Prediction Methods - Classification and Regression

Math Methods

Linear

LINEAR: Generate a linear solution.
Usage: linear rows input-columns output-columns <input-data >weights
Example: linear 150 4 3 <iris >iris.wts
Note: Least squares regression if output-columns=1; linear discriminant if output-columns>1.

TESTLINE: Apply a linear solution to data.
Usage: testline rows input-columns output-columns weightfile <input-data >results
Example: testline 150 4 3 iris.wts <iris >iris.res
Note: Least squares regression if output-columns=1; linear discriminant if output-columns>1.
Case-by-case output: row#,actual-answer,predicted-answer.

Neural Nets

NNET: Generate a neural net solution.
Usage: nnet rows input-columns output-columns weightfile hidden-units test-rows testfile <input-data
Example: nnet 150 4 3 irisnn.wts 3 <iris.nor >nnet.log
Note: Input data must be normalized to within -1 to 1.
Hidden-units defaults to a value that gives 2 cases per weight if unspecified.
Test-rows and testfile default to training data if not specified.
Weights saved incrementally after each iteration.
Proprietary optimization procedure for single hidden layer.
Trains well with many hidden units.
Computationally expensive; can run for days with big data.

TESTNET: Apply a neural net solution to data.
Usage: testnet rows input-columns output-columns weightfile <input-data >results
Example: testnet 150 4 3 irisnn.wts <iris.nor >iris.re1
Note: Case-by-case output: row#,actual-answer,predicted-answer.

Distance Methods - Nearest Neighbors

TESTNEAR: Apply a k-nearest neighbor solution to data.
Usage: testnear rows input-columns output-columns k #train trainfile <input-data >results
Example: testnear 50 4 3 1 100 iris.trn <iris.tst >iris.re2
Note: Case-by-case output: row#,actual-answer,predicted-answer.

Logic Methods

LOGIC: Display logic solution (tree or rules).
Usage: logic type output-columns solution >output
Example: logic r 3 iris.rul >rules.pp
Note: Type=t ==> tree solution; type=r ==> rule solution.

Decision Tree

DTREE: Generate a full decision tree solution.
Usage: dtree rows input-columns output-columns <input-data >cover-tree
Example: dtree 150 4 3 <iris >tree.np
Note: Output-Columns=1 ==> regression; output-columns>1 ==> classification.
Tree format: node,feature,value,left-node,right-node,node-label,#cases,errors.
Use logic routine to display tree.

PRUNE: Prune a tree.
Usage: prune input-tree output-columns signif-level >pruned-tree
Example: prune tree.np 3 1.65 >tree.165

TESTTREE: Apply a tree solution to data.
Usage: testtree rows input-columns output-columns tree <input-data >results
Example: testtree 150 4 3 tree.165 <iris >iris.re3
Note: Case-by-case output: row#,actual-answer,predicted-answer.

SETREE: Obtain the min-error and x-SE solution trees (x is user-specified).
Usage: setree trainrows trainfile testrows testfile input-columns output-columns treefile se
Example: setree 100 iris.trn 50 iris.tst 4 3 tree.np 1 >setree.log
Note: Output consists of performance of individual pruned trees.
The full tree is saved in treefile.
Se=0 gives the min-error solution tree.
(script)

Decision Rules

DRULE: Generate decision rules.
Usage: drule flag tree rows input-columns output-columns rulefile <input-data
Example: drule -O tree.165 150 4 3 iris.rul <iris >rul.log
Note: Converts a (pruned) tree to decision rules and stores rules in rulefile.
The optional flag -O turns on swap-based optimizations to reduce classification error.
Optimization is computationally expensive and should be applied to a tree size smaller than the minimum error tree.
Rules are evaluated in order. Answer of first satisfied rule is selected.
Use logic routine to display rules.

TESTRULE: Apply a rule-based solution to data.
Usage: testrule rows input-columns output-columns rulefile <input-data >results
Example: testrule 150 4 3 iris.rul <iris >iris.re4
Note: Case-by-case output: row#,actual-answer,predicted-answer.

Association Rules

ASSRULE: Generate association rules.
Usage: assrule treefile pmin covmin lenmax class rows input-columns output-columns <input-data >rules
Example: assrule tree.np 0.8 0.3 17 3 150 4 3 <iris >cls3.asr
Note: All rules generated for target-goal only.
Pmin = minimum predictive value of generated rules.
Covmin = minimum coverage of generated rules.
Lenmax = maximum length of generated rules.

Predictive Performance Summary

SUMMARY: Generate (html) tables summarizing error for cases, classes, and intervals.
Usage: summary rows output-columns <results >tables
Example: summary 150 3 <iris.res >tables.html
Note: Results are the output of testtree, testrule, testline testnet, testnear, avevote.
Output-columns are the number of goals in the original data file.
Tables must be viewed within a browser.


Return to Main Index