(c) 1997-2002 Sholom M. Weiss and Nitin Indurkhya
The Enterprise Data-Miner (EDM) is a collection of standalone programs that perform most of the data-mining tasks. All examples below are for the iris data set, a sample of 150 cases that is included in the package. A variety of other datasets are also provided.
|
![]() Take advantage of the latest software technology
with a Java version of EDM.
|
Important Note:
When using the EDM programs, just type the program name with no arguments, and the program will display a description of the correct syntax.EDM is available for Unix, Windows 9x/NT/2000 and Java. The platforms for the Unix versions are listed in the web page quick summary. Functionally, EDM is the same across all platforms. A command line interface is used: A program name is typed followed by a list of arguments. The Java version also has a Graphical User Interface (GUI) and can be fully run as an applet using Netscape or Internet Explorer. Complete documentation is available offline in the file edm.html.
Working in an MS-DOS window:
The datasets provided with EDM are downloaded separately from http://www.data-miner.com/download.html. Save as edmdata.zip and unzip in a separate directory.
| ===> |
|
The EDM programs process data in a format we call standard spreadsheet form. The editing programs can help transform data, such as text or unordered codes, into standard spreadsheet form. To arrive at standard spreadsheet form, labels and text are removed, leaving a spreadsheet full of numbers. Each row corresponds to a case, and each column corresponds to some measurement taken for each case. Two types of values for data-fields are allowed:
The application datasets are from the UCI repository http://www.ics.uci.edu/~mlearn/MLRepository.html. The table below summarizes the characteristics of these data. The number of features describes numerical features and categorical variables decomposed into binary features. Our principle objective is data mining, so we selected those datasets with relatively large numbers of training cases and an independent set of test cases. A noise datasets is also included: a set of random numbers with a .3 prior for the smaller class.
|
Name |
Train |
Test |
Features |
Classes |
|
adult |
30162 |
15060 |
105 |
2 |
|
blackj |
5000 |
10000 |
6 |
2 |
|
coding |
5000 |
15000 |
60 |
2 |
|
digit |
7291 |
2007 |
256 |
10 |
|
dna |
2000 |
1186 |
180 |
3 |
|
isolet |
6238 |
1559 |
617 |
26 |
|
led |
5000 |
5000 |
24 |
10 |
|
letter |
16000 |
4000 |
16 |
26 |
|
move |
1483 |
1546 |
76 |
2 |
|
noise |
5000 |
5000 |
20 |
2 |
|
satellite |
4435 |
2000 |
36 |
6 |
|
splice |
2175 |
1000 |
240 |
2 |
|
wave |
5000 |
5000 |
40 |
3 |
| NEDIT: | Verify correctness of input-data. Delete or keep specified columns. |
| Usage: | nedit rows columns -flag column-file <input-data >output-data |
| Example: | nedit 150 7 -d cols.del <iris >iris.eg1 |
| Note: | flag=d for deleting columns; flag=k for keeping columns. |
| Column-file contains numbers of columns to be deleted or kept. | |
| If no column-file specified, rewrite input-data in standard form. | |
| Input-data is a stream of numbers not necessarily in standard form. |
| NMERGE: | Merge two spreadsheets by concatenating corresponding rows. |
| Usage: | nmerge spreadsheet-1 spreadsheet-2 >merged-spreadsheet |
| Example: | nmerge iris.4 iris.1 >iris.all |
| Note: | (script) |
| DECODE: | Decode specified columns into true-or-false features. |
| Usage: | decode rows columns decode-file <input-data >output-data |
| Example: | decode 150 5 decode.col <iris.org >iris.new |
| Note: | Decode-file contains triples of column number, min value and max value. |
| NORM: | Normalize data to within -1 to +1. |
| Usage: | norm rows columns precision norm-file <input-data >output-data |
| Example: | norm 150 7 3 norm.val <iris >iris.nor |
| Note: | Norm-file contains the column divisors. |
| If norm-file non-existent, divisors generated from data. |
| MKDICT: | Create a dictionary of most frequently occuring words in given text. |
| Usage: | mkdict number-of-words stopwords-file input-text >dictionary |
| Example: | mkdict 5 stops cases.txt >dict |
| Note: | Words are alphabetic strings (case-insensitive). |
| Words in stopwords-file (one per line) ignored for dictionary. | |
| Dictionary is written in lowercase, alphabetized, one word per line. | |
| (script) |
| TEXT: | Convert free text into spreadsheet with features as dictionary words. |
| Usage: | text -flag dictionary case-delimiter <free-text >output-data |
| Example: | text -c dict 10 <cases.txt >dict.ssf |
| Note: | Flag = c ==> counts for dictionary words in output, |
| = i ==> true-or-false features for presence of dictionary words. | |
| Case-delimiter must be specified in decimal ASCII (e.g. 10 for newline). | |
| Words are alphabetic strings (case-insensitive). | |
| Dictionary words are in lowercase letters and alphabetic order, one word per line. |
| SEGMENT: | Segment a specified column into a specified number of clusters. |
| Usage: | segment rows columns segment-column num-clusters <input-data >output-data |
| Example: | segment 150 7 1 3 <iris >iris.seg |
| Note: | Segment-column is replaced by true-or-false features corresponding to each cluster. |
| CLUSTER: | Assign rows to k clusters. Uses multivariate k-means clustering. |
| Usage: | cluster rows input-columns k <input-data >output-data |
| Example: | cluster 150 4 3 <iris.4 >iris.cls |
| Note: | Output-data consists of true-or-false labels for each of the k clusters. |
| Input-data should be normalized when column values are measured on very different scales. | |
| Pick best k by using mean within-cluster error as a guide. | |
| Use MERGE to combine output labels with the input-data. |
| SIGNIF: | Feature selection by significance testing (with feature independence). |
| Usage: | signif rows input-columns output-columns sig <input-data >feature-nums |
| Example: | signif 150 4 3 2.5 <iris >iris.s25 |
| Note: | Sig used in the significance test and defaults to 2 if not specified. |
| Applicable only to classification problems (output-columns>1). |
| MSIG: | Feature selection using a tree. |
| Usage: | msig rows input-columns output-columns sig <input-data >feature-nums |
| Example: | msig 150 4 3 2.5 <iris >iris.f25 |
| Note: | Sig controls the amount of pruning and defaults to 4 if not specified. |
| (script) |
| ROUND: | Smooth values by roundoff to a maximum of maxv distinct values per input-column. |
| Usage: | round rows input-columns output-columns maxv <input-data >output-data |
| Example: | round 150 4 3 15 <iris >iris.smo |
| Note: | Values in output-columns are preserved and not subject to smoothing. |
| Rounding to a maximum of two decimal places. |
| KMEANS: | Smooth values by k-means clustering to a maximum of maxv distinct values per input-column. |
| Usage: | km rows input-columns output-columns maxv <input-data >output-data |
| Example: | km 150 4 3 15 <iris >iris.km |
| Note: | Values in output-columns are preserved and not subject to smoothing. |
| ENT: | Smooth values by entropy-based binning. |
| Usage: | ent rows input-columns output-columns thresh <input-data >output-data |
| Example: | ent 150 4 3 7 <iris >iris.ent |
| Note: | Values in output-columns are preserved and not subject to smoothing. |
| Thresh>1 ==> thresh = maximum distinct values per input-column. | |
| Thresh<1 ==> thresh = gain threshold for incremental bin addition. | |
| Thresh defaults to 0.01 if unspecified. | |
| Applicable only to classification problems (output-columns>1). |
| RANDOM: | Random subsample from spreadsheet, |
| Usage: | random rows columns percent seed-file <input-data >output-data |
| Example: | random 150 7 67 seed.67 <iris >iris.trn |
| Note: | If seed-file non-existent, clock used to generate a seed. |
| If seed-file unspecified, seed is not saved. | |
| percent=100 ==> bootstrap sample. |
| NSPLIT: | Obtain the complement of rows selected by RANDOM. |
| Usage: | nsplit rows columns seed-file <input-data >output-data |
| Example: | nsplit 150 7 seed.67 <iris >iris.tst |
| Note: | Seed-file must have been generated by RANDOM. |
| RANDOMWT: | Random sampling based on specified case probabilities |
| Usage: | randomwt rows columns percent flag missfile seed-file <input-data >output-data |
| Example: | randomwt 150 7 67 3 iris.mis seed.67 <iris >iris.tr2 |
| Note: | If seed-file non-existent, clock used to generate a seed. |
| If seed-file unspecified, seed is not saved. | |
| Flag = 1 for regression missfile; >1 for classification missfile. | |
| Missfile contains cumulative misses for each row. | |
| Non-existent missfile ==> all rows equally probable. |
| ARCWTS: | Obtain error count for each row (case) in spreadsheet. |
| Usage: | arcwts rows output-columns missfile <new-data |
| Example: | arcwts 150 3 iris.mis <iris.res |
| Note: | New-data must consist of triples: row#,actual,predicted. |
| Missfile contains errors for each row (updated with new-data). |
| AVEVOTE: | Combine solutions by averaging (for regression) or voting (for classification). |
| Usage: | avevote rows output-columns <vote-file >results |
| Example: | avevote 150 3 <iris.vot >iris.cas |
| Note: | Vote-file must consist of triples: row#,actual,predicted. |
| In voting, ties are resolved in favor of the lowest numbered goal. | |
| Case-by-case output: row#,actual-answer,predicted-answer. |
| BAG: | Combine (by averaging or voting) results by random subsampling (bagging). |
| Usage: | bag trainrows trainfile testrows testfile input-columns output-columns percent iterations sig |
| Example: | bag 100 iris.trn 50 iris.tst 4 3 40 10 1.5 >bag.log |
| Note: | Script uses tree method; other methods can be substituted. |
| Output consists of performance of individual trees and combined results. | |
| Samples with replacement. | |
| Sig defaults to 0 if unspecified (sig=0 ==> no pruning). | |
| Java class: a file may be specified for saving trees generated in each iteration | |
| (script) |
| BOOST: | Combine (by averaging or voting) results by adaptive subsampling (boosting). |
| Usage: | boost trainrows trainfile testrows testfile input-columns output-columns percent iterations sig |
| Example: | boost 100 iris.trn 50 iris.tst 4 3 40 10 1 >boost.log |
| Note: | Script uses tree method; other methods can be substituted. Output |
| consists of performance of individual trees and combined results. | |
| Samples with replacement. | |
| Sig defaults to 1.5 if unspecified (specify sig=0 for no pruning). | |
| Java class: a file may be specified for saving trees generated in each iteration | |
| (script) |
| LINEAR: | Generate a linear solution. |
| Usage: | linear rows input-columns output-columns <input-data >weights |
| Example: | linear 150 4 3 <iris >iris.wts |
| Note: | Least squares regression if output-columns=1; linear discriminant if output-columns>1. |
| TESTLINE: | Apply a linear solution to data. |
| Usage: | testline rows input-columns output-columns weightfile <input-data >results |
| Example: | testline 150 4 3 iris.wts <iris >iris.res |
| Note: | Least squares regression if output-columns=1; linear discriminant if output-columns>1. |
| Case-by-case output: row#,actual-answer,predicted-answer. |
| NNET: | Generate a neural net solution. |
| Usage: | nnet rows input-columns output-columns weightfile hidden-units test-rows testfile <input-data |
| Example: | nnet 150 4 3 irisnn.wts 3 <iris.nor >nnet.log |
| Note: | Input data must be normalized to within -1 to 1. |
| Hidden-units defaults to a value that gives 2 cases per weight if unspecified. | |
| Test-rows and testfile default to training data if not specified. | |
| Weights saved incrementally after each iteration. | |
| Proprietary optimization procedure for single hidden layer. | |
| Trains well with many hidden units. | |
| Computationally expensive; can run for days with big data. |
| TESTNET: | Apply a neural net solution to data. |
| Usage: | testnet rows input-columns output-columns weightfile <input-data >results |
| Example: | testnet 150 4 3 irisnn.wts <iris.nor >iris.re1 |
| Note: | Case-by-case output: row#,actual-answer,predicted-answer. |
| TESTNEAR: | Apply a k-nearest neighbor solution to data. |
| Usage: | testnear rows input-columns output-columns k #train trainfile <input-data >results |
| Example: | testnear 50 4 3 1 100 iris.trn <iris.tst >iris.re2 |
| Note: | Case-by-case output: row#,actual-answer,predicted-answer. |
| LOGIC: | Display logic solution (tree or rules). |
| Usage: | logic type output-columns solution >output |
| Example: | logic r 3 iris.rul >rules.pp |
| Note: | Type=t ==> tree solution; type=r ==> rule solution. |
| DTREE: | Generate a full decision tree solution. |
| Usage: | dtree rows input-columns output-columns <input-data >cover-tree |
| Example: | dtree 150 4 3 <iris >tree.np |
| Note: | Output-Columns=1 ==> regression; output-columns>1 ==> classification. |
| Tree format: node,feature,value,left-node,right-node,node-label,#cases,errors. | |
| Use logic routine to display tree. |
| PRUNE: | Prune a tree. |
| Usage: | prune input-tree output-columns signif-level >pruned-tree |
| Example: | prune tree.np 3 1.65 >tree.165 |
| TESTTREE: | Apply a tree solution to data. |
| Usage: | testtree rows input-columns output-columns tree <input-data >results |
| Example: | testtree 150 4 3 tree.165 <iris >iris.re3 |
| Note: | Case-by-case output: row#,actual-answer,predicted-answer. |
| SETREE: | Obtain the min-error and x-SE solution trees (x is user-specified). |
| Usage: | setree trainrows trainfile testrows testfile input-columns output-columns treefile se |
| Example: | setree 100 iris.trn 50 iris.tst 4 3 tree.np 1 >setree.log |
| Note: | Output consists of performance of individual pruned trees. |
| The full tree is saved in treefile. | |
| Se=0 gives the min-error solution tree. | |
| (script) |
| DRULE: | Generate decision rules. |
| Usage: | drule flag tree rows input-columns output-columns rulefile <input-data |
| Example: | drule -O tree.165 150 4 3 iris.rul <iris >rul.log |
| Note: | Converts a (pruned) tree to decision rules and stores rules in rulefile. |
| The optional flag -O turns on swap-based optimizations to reduce classification error. | |
| Optimization is computationally expensive and should be applied to a tree size smaller than the minimum error tree. | |
| Rules are evaluated in order. Answer of first satisfied rule is selected. | |
| Use logic routine to display rules. |
| TESTRULE: | Apply a rule-based solution to data. |
| Usage: | testrule rows input-columns output-columns rulefile <input-data >results |
| Example: | testrule 150 4 3 iris.rul <iris >iris.re4 |
| Note: | Case-by-case output: row#,actual-answer,predicted-answer. |
| ASSRULE: | Generate association rules. |
| Usage: | assrule treefile pmin covmin lenmax class rows input-columns output-columns <input-data >rules |
| Example: | assrule tree.np 0.8 0.3 17 3 150 4 3 <iris >cls3.asr |
| Note: | All rules generated for target-goal only. |
| Pmin = minimum predictive value of generated rules. | |
| Covmin = minimum coverage of generated rules. | |
| Lenmax = maximum length of generated rules. |
| SUMMARY: | Generate (html) tables summarizing error for cases, classes, and intervals. |
| Usage: | summary rows output-columns <results >tables |
| Example: | summary 150 3 <iris.res >tables.html |
| Note: | Results are the output of testtree, testrule, testline testnet, testnear, avevote. |
| Output-columns are the number of goals in the original data file. | |
| Tables must be viewed within a browser. |