  ============================================================================
  This is the UCI Repository Of Machine Learning Databases and Domain Theories
                             4 December 1995
              ftp.ics.uci.edu: pub/machine-learning-databases
	     http://www.ics.uci.edu/~mlearn/MLRepository.html
          Librarian: Patrick M. Murphy (ml-repository@ics.uci.edu)
                  111 databases and domain theories (36MB)
  ============================================================================

This directory contains data sets and domain theories (the latter have been
annotated as such in the following brief listing) that have been or can be
used to evaluate learning algorithms. Each data file (*.data) contains
individual records described in terms of attribute-value pairs.  The
corresponding *.info file contains voluminous documentation.  (Some files
_generate_ databases; they do not have *.data files.)

In addition to data sets and domain theories, the "utilities/" directory
contains utilities that you may find useful when using datasets in this
repository.

The contents of this repository can be viewed and remotely copied over
the web.  The address is http://www.ics.uci.edu/~mlearn/MLRepository.html.  
Alternatively, the contents of this repository can be remotely copied via 
ftp to ftp.ics.uci.edu.  Enter "anonymous" for user id, and e-mail address 
(user@host) for password.  These databases can be found by executing 
"cd pub/machine-learning-databases".

Notes:
 1. We're always looking for addition	al databases, which can be
    written to the sub-directory named "/incoming". Please send yours, with 
    documentation.  Thanks -- See DOC-REQUIREMENTS for suggested documentation 
    procedures. Presently, most databases have the following format: 1 
    instance per line, no spaces, commas separate attribute values, and 
    missing values are denoted by "?".  Also, please notify the site librarian 
    (ml-repository@ics.uci.edu) after making a donation.

 2. Ivan Bratko requested that the databases he donated from the Ljubljana
    Oncology Institute (e.g., breast-cancer, lymphography, and primary-tumor)
    have restricted access. We are allowed to share them with academic
    institutions upon request. These databases (like several others) require
    providing proper citations be made in published articles that use them.
    Citation requirements are in each database's corresponding *.doc file.
    To access any of these databases, send email to ml-repository@ics.uci.edu.
    To aid you in deciding if you want any of these databases, the 
    documentation files are available.

 3. An archive server may now be used to recieve via e-mail files in this
    repository.  Installed on ics, it provides email access to files in
    our anonymous ftp/uucp area (~ftp).  If people have no other access to
    our archives, then they can send mail to:

	archive-server@ics.uci.edu

    Commands to the server may be given in the body.  Some commands are:

	help
	send <archive> <file>
	find <archive> <string>

    The help command replies with a useful help message.

If you publish material based on databases obtained from this repository,
then, in your acknowledgements, please note the assistance you received by
using this repository.  Thanks -- this will help others to obtain the same
data sets and replicate your experiments.  We suggest the following pseudo-APA
reference format for referring to this repository (LaTeX'd):

  Murphy,~P.~M., \& Aha,~D.~W. (1994). {\it UCI Repository of machine
  learning databases} [http://www.ics.uci.edu/~mlearn/MLRepository.html]. 
  Irvine, CA: University of California, Department of Information and Computer 
  Science.

Patrick M. Murphy (Repository Librarian)
     
----------------------------------------------------------------------
Brief Overview of Databases and Domain Theories:

Quick Listing:
 1. annealing (David Sterling and Wray Buntine)
 2. Artificial Characters Database & DT (donated by Attilio Giordana)
 3-4. audiology (Ray Bareiss and Bruce Porter, used in Protos)
    1. Original Version
    2. Standardized-Attribute Version of the Original.
 5. auto-mpg (from CMU StatLib library)
 6. autos (Jeff Schlimmer)
 7. badges (Haym Hirsh)
 8. balance-scale (Tim Hume)
 9. balloons (Michael Pazzani)
 10. breast-cancer (Ljubljana Institute of Ontcology, restricted access)
 11. breast-cancer-wisconsin (Wisconsin Breast Cancer D'base, Olvi Mangasarian)
   1. Original version
   2. Diagnostic data set
   3. Prognostic data set
 12. bridges (Yoram Reich)
 13-21. chess
   1. Partial generator of Quinlan's chess-end-game data (kr-vs-kn) (Schlimmer)
   2. Shapiros' endgame database (kr-vs-kp) (Rob Holte)
   3. king-rook-vs-king (Michael Bain, Arthur van Hoff)
   4-9. Six domain theories (Nick Flann)
 22. Bach Chorales (time-series) database (Darrell Conklin)
 23. Connect-4 Database (John Tromp)
 24-25. Credit Screening Database
   1. Japanese Credit Screening Data and domain theory (Chiharu Sano)
   2. Credit Card Application Approval Database (Ross Quinlan)
 26. Ein-Dor and Feldmesser's cpu-performance database (David Aha)
 27. Diabetes Data (Serdar Uckun, AI-M94)
 28. dgp-2 data generation program (Powell Benedict)
 29. Document Understanding (Donato Malerba)
 30. Nine small EBL domain theories and examples in sub-directory ebl
 31. Evlin Kinney's echocardiogram database (Steven Salzberg)
 32. flags (Richard Forsyth)
 33. function-finding (Cullen Schafer's 352 case studies)
 34. glass (Vina Spiehler)
 35. hayes-roth (from Hayes-Roth^2's paper)
 36-39. heart-disease (Robert Detrano)
 40. hepatitis (G. Gong)
 41. horse colic database (Mary McLeish & Matt Cecile)
 42. (Boston) Housing database (from CMU StatLib library)
 43. ICU data (Serdar Uckun, AIM-94)
 44. Image segmentation database (Carla Brodley)
 45. ionosphere information (Vince Sigillito) 
 46. iris (R.A. Fisher, 1936)
 47. isolet (Ron Cole and Mark Fanty's database donated by Tom Dietterich)
 48. kinship (J. Ross Quinlan)
 49. labor-negotiations (Stan Matwin)
 50-51. led-display-creator (from the CART book)
 52. lenses (Cendrowska's database donated by Benoit Julien)
 53. letter-recognition database (created and donated by David Slate)
 54. liver-disorders (BUPA Medical's database donated by Richard Forsyth)
 55. logic-theorist (Paul O'Rorke)
 56. lung cancer (Stefan Aeberhard)
 57. lymphography (Ljubjana Institute of Oncology, restricted access)
 58-59. mechanical-analysis (Francesco Bergadano)
  1. Original Mechanical Analysis Data Set
  2. PUMPS DATA SET
 60 mobile robots (donated by Klingspor, Morik and Rieger)
 61-64. molecular-biology 
     1. promoter sequences (Towell, Shavlik, & Noordewier, domain theory also)
     2. splice-junction sequences (Towell, Noordewier, & Shavlik, 
        domain theory also)
     3. protein secondary structure database (Qian and Sejnowski)
     4. protein secondary structure domain theory (Jude Shavlik & Rich Maclin)
 65. MONK's Problems (donated by Sebastian Thrun)
 66. Moral Reasoner Database (donated by James Wogulis)
 67. mushroom (Jeff Schlimmer)
 68. MUSK databases (2) (donated by Tom Dietterich)
 69. othello domain theory (Tom Fawcett)
 70. Page Blocks Classification (Donato Malerba)
 71. Pima Indians diabetes diagnoses (Vince Sigillito) 
 72. Postoperative Patient data (Jerzy W. Grzymala-Busse)
 73. Primary Tumor (Ljubjana Institute of Oncology, restricted access)
 74. Qualitative Structure Activity Relationships (QSARs) (Ross King)
 75. Quadraped Animals (John H. Gennari)
 76. Servo data (Ross Quinlan)
 77. shuttle-landing-control (Bojan Cestnik)
 78. solar flare (Gary Bradshaw)
 79-80. soybean (from Ryszard Michalski's groups)
 81. space shuttle databases (David Draper)
 82. spectrometer (Infra-Red Astronomy Satellite Project Database, John Stutz)
 83. Sponge Database (Iosune Uriz and Marta Domingo)
 84. Statlog Project databases (7) (from Ross King,...)
 85  Student Loan relational database (from Michael Pazzani)
 86. tic-tac-toe endgame database (Turing Institute, David W. Aha)
 87-97. thyroid-disease (Garavan Institute, J. Ross Quinlan; Stefan Aeberhard)
 98. trains database (David Aha & Eric Bloedorn)
 99-104. Undocumented databases: sub-directory undocumented
   1. Economic sanctions database (domain theory included, Mike Pazzani)
   2. Cloud cover images (Philippe Collard)
   3. DNA secondary structure (Qian and Sejnowski, donated by Vince Sigillito) 
   4. Nettalk data (Sejnowski and Rosenberg, taken from connectionist-bench)
   5. Sonar data (Gorman and Sejnowski, taken from connectionist-bench)
   6. Vowel data (Qian, Sejnowski and Turney, taken from connectionist-bench)
 105. university (Michael Lebowitz, donated by Steve Souders)
 106. voting-records (Jeff Schlimmer)
 107. water treatement plant data (donated by Javier Bejar and Ulises Cortes)
 108-109. Waveform domain (taken from CART book)
 110. Wine Recognition Database (donated by Stefan Aeberhard)
 111. Zoological database (Richard Forsyth)

Quick Summaries of Each Database:
1. Annealing data (unknown source)
   -- Documentation: On everything except database statistics
   -- Background information on this database: unknown
   -- Many missing attribute values

2. Artificial Characters Database & DT
   -- artificially generated using a first order theory (which 
   -- describes the structure of ten capitol letters) and random 
   -- choice theorem prover.
   -- Domain Theory included.

3-4. Audiology data
   1. Original Version (Baylor College)
      -- Documentation: On everything except database statistics
      -- Non-standardized attributes (differs between instances)
      -- All attributes are nominally-valued
   2. Standard Attribute Version of the original
      -- A standard set of attributes have been defined in terms of the
         orignal properties according to a well defined set of rules
         described in the documentation files.
      -- 70 nominally-valued attributes
      -- Some missing attributes

5. Auto-Mpg data (revised from CMU StatLib library)
   -- data concerns city-cycle fuel consumption
   -- Continuously valued class attribute (mpg)
   -- 398 instances, 5 numeric attributes.

6. Automobile data (1985 Ward's Automotive Yearbook)
   -- Documentation: On everything except statistics and class distribution
   -- Good mix of numeric and nominal-valued attributes
   -- More than 1 attribute can be used as a class attribute in this database

7. badges (Haym Hirsh)
   -- 294 instances, 2 classes.
   -- Instances are described using a sequence of characters (a name)
   -- Badge problem generated for attendee's to figure out at MLC94

8. Balance Scale (Tim Hume)
   -- 625 instances, 4 numeric attributes
   -- 3 classes (tip right, tip left, balanced)
   -- no missing values

9. Balloons database (Michael Pazzani)
   -- Previously used in cognitive psychology experiment
   -- 16 instances, 2 classes, 4 attributes
   -- No missing values

10. Breast cancer database (Ljubljana Oncology Institute)
   -- Documentation: On everything except database statistics
   -- Well-used database
   -- 286 instances, 2 classes, 9 attributes + the class attribute

11. Wisconsin Breast cancer databases
   1. original dataset (donated by Olvi Mangasarian)
      -- Located in breast-cancer-wisconsin sub-directory, root
         filename: breast-cancer-wisconsin
      -- Currently contains 699 instances
      -- 2 classes (malignant and benign)
      -- 9 integer-valued attributes
   2. prognostic data set (donated by Nick Street)
      -- Located in breast-cancer-wisconsin sub-directory, root
         filename: wpbc
      -- 198 instances
      -- Two learning tasks: 2 class prediction, or time to
         recur/heal
      -- 30 numeric attributes
   3. diagnostic data set (donated by Nick Street)
      -- Located in breast-cancer-wisconsin sub-directory, root
         filename: wdbc
      -- 569 instances
      -- 2 classes (malignant and benign)
      -- 30 numeric attributes

12. Pittsburgh Bridges Database (donated by Yoram Reich)
    -- Topic: design knowledge
    -- 108 instances, 13 attributes (7 specifications, 5 design description, 
       and 1 identifier)
    -- 2 versions of the data: original and numeric-discretized

13-21. Chess
     1. king-rook-vs-king-knight
        -- Documentation: limited (nothing on class distribution, statistics)
        -- This concerns king-knight versus king-rook end games
        -- The database creator is coded in Common Lisp
     2. king-rook-vs-king-pawn
        -- Documentation: sufficient
        -- This concerns king-rook versus king-pawn end games
        -- Originally described by Alen Shapiro 
     3. king-rook-vs-king (donated/created by Michael Bain, Arthur van Hoff)
        -- 28056 instances, 6 nominal features
        -- 17 classes to determine optimal depth-of-win
     3-8. Six domain theories donated by Nick Flann 
        -- In the "domain-theories" sub-directory
        -- Coded in a dialect of Prolog
        -- They all generate legal moves of chess
        -- I haven't yet touched Nick's documentation on them (See README)

22. Bach Chorales (time-series) database (Darrell Conklin)
    -- Single-line melodies of 100 Bach chorales (originally 4 voices).
    -- Number of Instances: 100 Chorales, each with ~45 events
    -- Number of Attributes: 6 (nominal) per event
    
23. Connect-4 Opening Database (donated/created by John Tromp)
    -- contains all legal 8-ply positions in the game of connect-4 in
       which neither player has won yet, and in which the next move 
       is not forced.
    -- 67557 instances, 42 nominal attributes

24-25. Credit Screening Database
    1. Japanese Credit Screening Database and Domain Theory
       --  Positive instances are people who were granted credit.
       --  The theory was generated by talking to Japanese domain experts
    2. Credit Card Application Approval Database
       -- a good mix of attributes -- continuous, nominal with small numbers
          of values, and nominal with larger numbers of values. 
       -- 690 instances, 15 attributes some with missing values.

26. Computer hardware described in terms of its cycle time, memory size, etc.
   and classified in terms of their relative performance capabilities (CACM
   4/87)   
   -- Documentation: complete
   -- Contains integer-valued concept labels
   -- All attributes are integer-valued

27. AIM-94 Diabetes data
    -- Non-Uniform Data format
    -- Time dependencies

28. The Second Data Generation Program - DGP/2 
   -- Generates instances around peaks and allows for specification of the 
      mean and standard deviations in the normally distributed data.
   -- Generates application domains based on specific parameters: number of 
      features, and proportion of positive to negative examples.
   -- Allows for variations in the number of instances, the range of feature 
      values, the number of peaks, the percent of positive instances desired 
      and a radius around the peaks that these instances fall within.

29. Document Understanding Database (Donato Malerba)
   -- Five concepts, expressed as predicates, to be learned.
   -- mulptiple predicate learning problem
   -- see .info file for more information

30. Nine simple small EBL domain theories and examples in sub-directory ebl
   1. cup
   2. deductive.assumable (contains three domain theories)
   3. emotion
   4. ice
   5. pople
   6. safe-to-stack
   7. suicide

31. Echocardiogram database (Reed Institute, Miami)
   -- Documentation: sufficient
   -- 13 numeric-valued attributes
   -- Binary classification: patient either alive or dead after survival period

32. Flags database (Collins Gem Guide to Flags, 1986)
    -- 194 instances, mixed numeric- and nominal-valued attributes
    -- Information on countries, colors of flag components, etc.
    -- donated by Richard S. Forsyth, creator of PC/BEAGLE

33. 352 Studies in Function-Finding (donated by Cullen Schafer)
    -- 352 small "databases" (cases) of bivarate numeric data sets
    -- Collected mostly from investigations in physical science
    -- Intention: Evaluation of function-finding algorithms

34. Glass Identification database (USA Forensic Science Service)
    -- Documentation: completed
    -- 6 types of glass 
    -- Defined in terms of their oxide content (i.e. Na, Fe, K, etc)
    -- All attributes are numeric-valued 

35. Hayes-Roth and Hayes-Roth's database
    -- Described in their 1977 paper
    -- Topic: human subjects study

36-39. Heart Disease databases (Sources listed below)
      -- Documentation: extensive, but statistics and missing attribute
         information not yet furnished (perhaps later)
      -- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
      -- 13 of the 75 attributes were used for prediction in 2 separate 
         tests, each of which achieved approximately 75%-80% classification
         accuracy
      -- The chosen 13 attributes are all continuously valued
      -- Includes cost data donated by Peter Turney

40. Hepatitis database (G.Gong: CMU)
    -- Documentation: incomplete
    -- 155 instances with 20 attributes each; 2 classes
    -- Mostly Boolean or numeric-valued attribute types
    -- Includes cost data donated by Peter Turney

41. Horse Colic database (Mary McLeish & Matt Cecile)
    -- Well documented attributes
    -- 368 instances with 28 attributes (continuous, discrete, and nominal)
    -- 30% missing values

42. (Boston) Housing database (from CMU StatLib library)
    -- concerns housing prices in suburbs of Boston
    -- Continuously valued class attribute (MEDV)
    -- 506 instances, 12 continuous, 1 binary attributes 

43. AIM-94 ICU data (Serdar Uckun)
    -- Deals with ICU treatment of patients with Adult respiratory 
       distress syndrome (ARDS)
    -- Complex dataset (see documentation)

44. Image segmentation database (Carla Brodley: UMass)
    -- Documentation status: Skimpy
    -- Not previously used in the ml literature as of 8/1991
    -- Image data described by high-level numeric-valued attributes, 7 classes
  
45. Ionosphere database (V. Sigillito)
   -- Documentation Complete
   -- 2 classes, 351 instances, 34 numeric attributes, no missing values
   -- Classification of radar returns from the ionosphere

46. Iris Plant database (Fisher, 1936)
   -- Documentation: complete
   -- 3 classes, 4 numeric attributes, 150 instances 
   -- 1 class is linearly separable from the other 2, but the other 2 are
      not linearly separable from each other (simple database)

47. Isolet Spoken Letter Recognition database (Ron Cole and Mark Fanty)
    -- 6238 + 1559 instances, 26 classes (one for each letter)
    -- All attributes are real-valued scaled from -1.0 to 1.0.
    -- No missing values

48. Kinship database (relational, Hinton 1986 & Quinlan 1989)
    -- 24 individuals, 12 relations 
    -- 104 instances derivable 
    -- Case studies have been reported by both authors

49. Labor relations database (Collective Bargaining Review)
    -- Documentation: no statistics
    -- Please see the labor directory for more information

50-51. LED display domains (Classification and Regression Trees book)
    -- Documentation: sufficient, but missing statistical information
    -- All attributes are Boolean-valued
    -- Two versions: 7 and 24 attributes
    -- Optimal Baye's rate known for the 10% probability of noise problem
    -- Several ML researchers have used this domain for testing noise tolerancy
    -- We provide here 2 C programs for generating sample databases

52. Lenses: Fitting contact lenses (donated by Benoit Julien)
    -- Small database with few attributes 
    -- attributes are either binary- or ternary-valued
    -- 3 classes: hard contact lenses, soft contact lenses, or neither

53. David Slate's letter recognition database (real)
    -- 20,000 instances (712565 bytes) (.Z available)
    -- 17 attributes: 1 class (letter category) and 16 numeric (integer)
    -- No missing attribute values

54. Liver-disorders
    -- BUPA Medical Research Ltd. database donated by Richard S. Forsyth
    -- 7 numeric-valued attributes
    -- 345 instances (male patients)
    -- Includes cost data donated by Peter Turney

55. Logic-theorist
    -- Paul O'Rorke's work, as described in Machine Learning

56. Lung Cancer database (Donated by Stefan Aeberhard)
    -- 32 instances, 57 Attributes (2 classes)
    -- No Attribute Definitions

57. Lymphography database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 148 instances; 19 attributes; 4 classes; no missing data values

58-59. Mechanical analysis (Donated by members of the Universita di Torino)
   1.  -- Fault diagnosis problem of electromechanical devices
       -- ENIGMA system application described in proceedings of MLC-1990
       -- Each of the 209 instances is described by a different set of 
          components
   2.  -- PUMPS DATA SET
       -- Newer version of above dataset with domain theory and results

60. Mobile Robots (Donated by Klingspor, Morik and Rieger)
   -- Learning Concepts from Sensor Data of a Mobile Robot
   -- Relational
   -- Multiple levels of learning (from raw sensor data to high
      level concepts)

61-64. Molecular Biology directory
    1. Promoter gene sequences
       -- Donated by Jude Shavlik; See AAAI-90 Towell, Shavlik, & Noordewier
       -- E. Coli promoter gene sequences (DNA) with partial domain theory
       -- 106 instances, each predictor attribute takes on one of four values
       -- 50% positive instances
    2. Splice-junction gene sequences
       -- Donated by Geoffrey Towell, Noordewier, & Shavlik.
       -- categories "ei" and "ie" include every "split-gene"
          for primates in Genbank 64.1
       -- non-splice examples taken from sequences known not to include
          a splicing site
       -- 3190 instances with classes "ei" (25%), "ie" (25%) and 
          Neither (50%). 
       -- Domain theory included.
     3. Protein Secondary Structure Database 
       -- Originally created and used by Qian and Sejnowski
       -- From CMU connectionist bench repository
       -- Classifies secondary structure of certain globular proteins
       -- 3 classes: alpha-helix, beta-sheet and random-coil.
     4. Protein Secondary Structure Domain Theory 
       -- Donated and created by Jude Shavlik & Rich Maclin
       -- Imperfect domain theory for Qian and Sejnowski Protein
          Secondary Structure database (above)
       -- Closely implements the algorithm of Chou and Fasman

65. MONK's Problems (donated by Sebastian Thrun)
    -- A set of three artificial domains over the same attribute space.
    -- 6 nominally values attributes, no missing values.
    -- 1 problems has class noise added.
    -- Used to test a wide range of induction algorithms.

66. Moral Reasoner database (donated by James Wogulis)
    -- Horn-clause model that qualitatively simulates moral reasoning.
    -- 202 instances and theory
    -- Theory includes negated literals.

67. Mushrooms in terms of their physical characteristics and classified
    as poisonous or edible (Audobon Society Field Guide)
    -- Documentation: complete, but missing statistical information
    -- All attributes are nominal-valued
    -- Large database: 8124 instances (2480 missing values for attribute #12)

68. MUSK databases (2) (donated by Tom Dietterich)
    -- Task: to classify if musk molecule
    -- 476 and 6,598 instances, 168 attributes
    -- Was used to explore "multiple instance problem"
    
69. Othello Domain Theory: used in research to generate features for an
    inductive learning system
    -- Written and donated by Tom Fawcett
    -- Coded in Prolog

70. Page Blocks Classification (Donato Malerba)
   -- The problem consists in classifying all the blocks of the page
      layout of a document that has been detected by a segmentation
      process. This is an essential step in document analysis.
   -- 5473 examples comes from 54 distinct documents
   -- All attributes are numeric.

71. Pima Indians Diabetes Database (National Institute of Diabetes and
    Digestive and Kidney Diseases)
    -- Binary classes (tested positive or negative for diabetes)
    -- All 8 attributes are numeric-valued 
    -- 768 instances
    -- Includes cost data donated by Peter Turney

72. Postoperative Patient data (Jerzy W. Grzymala-Busse)
    -- 3 classes
    -- 90 instances
    -- 8 attributes, one numeric with missing values

73. Primary Tumor database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 339 instances; 18 attributes; 22 classes; lots of missing data values

74. Qualitative Structure Activity Relationships (QSARs) (Ross King)
    -- Two sets of dataset are given: pyrimidines and triazines.
    -- 3 representations: ILP, Propositional Machine Learning Discrimination,
    -- and Propositional Machine Learning Regression.

75. Quadraped Animals data generator (John H. Gennari)
    -- Structured data; each instance has 9 components, with 9 numeric-valued
       attributes per component
    -- 4 classes
    -- Previously used to evaluate unsupervised learning algorithms

76. Servo data (Ross Quinlan)
    -- numerically valued class attribute
    -- 4 nominal attributes; 167 instances
    -- covers an extremely non-linear phenomenon

77. Shuttle Landing Control database
    -- tiny, 15-instance database with 7 attributes per instance; 2 classes
    -- appears to be well-known in the decision-tree community

78. Solar Flare database (Gary Bradshaw)
    -- 1389 instances, 13 attributes (includes 3 class attributes)
    -- Each class attribute counts the number of solar flares of a 
       certain class that occur in a 24 hour period.
    -- Prediction attributes are nominal; no missing values

79-80. Soybean data (Michalski)
   -- Documentation: Only the statistics is missing
   -- (2 sizes)
   -- Michalski's famous soybean disease databases

81. Challenger USA Space shuttle O-Ring Databases (David Draper)
    - 2 small 23-instance databases containing only positive integers
    - fascinating topic: Analysis of launch temperature vs. O-ring stress
    - task: predict the number of O-rings that experience thermal distress
      on a flight at 31 degrees F given data on the previous 23 shuttle 
      flights.

82. Low resolution spectrometer data (IRAS data -- NASA Ames Research Center)
    -- Documentation: no statistics nor class distribution given
    -- LARGE database...and this is only 531 of the instances
    -- 98 attributes per instance (all numeric)
    -- Contact NASA-Ames Research Center for more information

83. Sponge Database (donated by Javier Bejar and Ulises Cortes)
    -- Classification of atlantic-mediterranean marine sponges.
    -- 76 instances
    -- 45 nominal and numeric attributes (some missing values).

84. Statlog Project databases (from Ross King)
   -- Vehicle Silhouettes: 3D objects within a 2D image by 
      application of an ensemble of shape feature extractors
      to the 2D silhouettes of the objects.
   -- Landsat Satellite: multi-spectral values of pixels in 
      3x3 neighbourhoods in a satellite image, and the 
      classification associated with the central pixel in each 
      neighbourhood.
   -- Shuttle: The shuttle dataset contains 9 attributes all of 
      which are numerical. Approximately 80% of the data belongs 
      to class 1.
   -- Australian Credit Approval: This file concerns credit card 
      applications.  This database exists elsewhere in the repository 
      (Credit Screening Database) in a slightly different form.
   -- Heart Disease: This dataset is a heart disease database similar
      to a database already present in the repository (Heart Disease 
      databases) but in a slightly different form.
   -- Image Segmentation: This dataset is an image segmentation 
      database similar to a database already present in the repository 
      (Image segmentation database) but in a slightly different form.
   -- German Credit Database: This dataset classifies people described 
      by a set of attributes as good or bad credit risks.  Comes in 
      two formats (one all numeric). Also comes with a cost matrix.

85. Student Loan Relational database  & domain theory (from Michael Pazzani)
    -- Target concept: no_payment_due by person for student loan.
    -- 1000 instances of target concept.
    -- Domain Theory
    -- 10+ extensionally and intesionally defined relations.

86. Tic-Tac-Toe Endgame database (David W. Aha, Turing Institute)
    -- Documentation complete as of Summer 1991
    -- 958 instances, all attributes can take on 1 of 3 possible values
    -- Binary classification task (i.e., "win for x")
    -- A paradigmatic domain for constructive induction studies

87-97. Thyroid patient records classified into disjoint disease classes 
       (Garavan Institute)
       -- Documentation: as given by Ross Quinlan
       -- 6 databases from the Garavan Institute in Sydney, Australia
       -- Approximately the following for each database:
          -- 2800 training (data) instances and 972 test instances
          -- plenty of missing data
          -- 29 or so attributes, either Boolean or continuously-valued
       -- 2 additional databases, also from Ross Quinlan, are also here
          -- hypothyroid.data and sick-euthyroid.data
          -- Quinlan believes that these databases have been corrupted
          -- Their format is highly similar to the other databases
       -- 1 more database of 9172 instances that cover 20 classes, and
          a related domain theory
       -- Another thyroid database from Stefan Aeberhard
          -- 3 classes, 215 instances, 5 attributes
          -- no missing values
       -- A Thyroid database suited for training ANNs
          -- 3 classes
          -- 3772 training instances, 3428 testing instances
          -- Includes cost data onated by Peter Turney
          

98. Trains database (by David Aha & Eric Bloedorn)
    -- Original owners: R. Michalski & R. Stepp
    -- 10 instances
    -- 10 attributes + class (direction: east or west)
    -- 2 data formats (structured, one-instance-per-line)
    -- includes "east-west" competion data and results

99-104. Undocumented databases: see the sub-directory named undocumented
   1. Mike Pazzani's economic sanctions database
   2. Philippe Collard's database on cloud cover images
   3. Vince Sigillito's database on dna secondary structure
   4. Nettalk data (see connectionist-bench)
   5. Sonar data (see connectionist-bench)
   6. Vowel data (see connectionist-bench)

105. University data (Lebowitz)
    -- Documentation: scant; we've left it in its original (LISP-readable) form
    -- 285 instances, including some duplicates
    -- At least one attribute, academic-emphasis, can have multiple values
       per instance
    -- The user is encouraged to pursue the Lebowitz reference for more 
       information on the database

106. Congressional voting records classified into Republican or Democrat (1984
    United Stated Congressional Voting Records)
    -- Documentation: completed
    -- All attributes are Boolean valued; plenty of missing values; 2 classes
    -- Also, their is a 2nd, undocumented database containing 1986 voting 
       records here. (will be)

107. Water Treatement Plant Data (Javier Bejar and Ulises Cortes)
    -- 38 numeric attributes; 527 instances; missing values
    -- Multiple classes predict plant state
    -- "Ill-Stuctured Domain"

108-109. Waveform data generator (Classification and Regression Trees book)
       -- Documentation: no statistics
       -- CART book's waveform domains
       -- 21 and 40 continuous attributes respectively
       -- difficult concepts to learn, but known Bayes optimal classification
          rate of 86% accuracy

110. Wine Recognition database (donated by Stefan Aeberhard)
    -- Using chemical analysis determine the origin of wines
    -- 13 attributes (all continuous), 3 classes, no missing values
    -- 178 instances

111. Richard Forsyth's zoological database (artificial)
    -- 7 classes of animals 
    -- 17 attributes (besides name), 15 Boolean and 2 numeric-valued
    -- No missing attribute values



