http://tkustat.stat.tku.edu.tw
Probability  |  Statistics  |  Data  |  Demo  |  Samples  |  Links  |  Document  ( ENGLISH )

Classification and Regression Trees (CART)
Data: babies (d)
   

===========
Input data file
===========

R command> library(tkustat)          
R command> data(babies)          
R command> mydata <- babies 

===================
Correlation Matrix of all variables
===================
R command> cor(mydata)

                  bwt   gestation      parity          age       height
bwt        1.00000000  0.40754279 -0.04390817  0.026982911  0.203704177
gestation  0.40754279  1.00000000  0.08091603 -0.053424774  0.070469902
parity    -0.04390817  0.08091603  1.00000000 -0.351040648  0.043543487
age        0.02698291 -0.05342477 -0.35104065  1.000000000 -0.006452846
height     0.20370418  0.07046990  0.04354349 -0.006452846  1.000000000
weight     0.15592327  0.02365494 -0.09636209  0.147322111  0.435287428
smoke     -0.24679951 -0.06026684 -0.00959897 -0.067771942  0.017506595
               weight       smoke
bwt        0.15592327 -0.24679951
gestation  0.02365494 -0.06026684
parity    -0.09636209 -0.00959897
age        0.14732211 -0.06777194
height     0.43528743  0.01750660
weight     1.00000000 -0.06028140
smoke     -0.06028140  1.00000000

=========================
Paired Scatter Plot
=========================

R command> pairs(mydata)
==========================================
Classification and Regression Trees(CART)
==========================================
Dependent Variable: bwt

Independent Variables: G gestation + parity + age + height + weight + smoke


=========================================
Method 1. Use tree function in tree package
=========================================
R command> library(tree)
R command> mydata.ltr <- tree(bwt ~  gestation + parity + age + height + weight + smoke, data=mydata)
R command> mydata.ltr

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1174 394100 119.50  
   2) gestation < 272.5 303 104400 107.80  
     4) gestation < 239.5 24  10210  84.71 *
     5) gestation > 239.5 279  80230 109.80  
      10) smoke < 0.5 152  39900 115.00 *
      11) smoke > 0.5 127  31580 103.70 *
   3) gestation > 272.5 871 234500 123.50  
     6) gestation < 283.5 432  99990 119.40  
      12) smoke < 0.5 256  54920 122.20 *
      13) smoke > 0.5 176  39870 115.20 *
     7) gestation > 283.5 439 119800 127.60  
      14) smoke < 0.5 292  73660 129.80 *
      15) smoke > 0.5 147  41940 123.20 *

R command> summary(mydata.ltr)


Regression tree:
tree(formula = bwt ~ gestation + parity + age + height + weight + 
    smoke, data = mydata)
Variables actually used in tree construction:
[1] "gestation" "smoke"    
Number of terminal nodes:  7 
Residual mean deviance:  250.3 = 292100 / 1167 
Distribution of residuals:
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-5.123e+01 -1.078e+01 -2.357e-01  1.816e-16  9.830e+00  5.177e+01 

________________________________
Tree Plot 1.(from tree package)
________________________________

R command> plot(mydata.ltr);  text(mydata.ltr)
==================================================
Method 2. Use rpart function in rpart package
==================================================
R command> library(rpart)


Attaching package `rpart':


	The following object(s) are masked from package:tree :

	 descendants node.match tree.depth 


R command> mydata.ltr <- rpart(bwt ~  gestation + parity + age + height + weight + smoke, data=mydata)
R command> mydata.ltr

n= 1174 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1174 394057.90 119.46250  
   2) gestation< 272.5 303 104389.70 107.84490  
     4) gestation< 239.5 24  10208.96  84.70833 *
     5) gestation>=239.5 279  80228.42 109.83510  
      10) smoke>=0.5 127  31578.22 103.70870 *
      11) smoke< 0.5 152  39900.68 114.95390 *
   3) gestation>=272.5 871 234545.70 123.50400  
     6) gestation< 283.5 432  99992.52 119.35190  
      12) smoke>=0.5 176  39874.89 115.17050 *
      13) smoke< 0.5 256  54924.86 122.22660 *
     7) gestation>=283.5 439 119776.20 127.59000  
      14) smoke>=0.5 147  41939.18 123.24490 *
      15) smoke< 0.5 292  73664.53 129.77740 *

R command> summary(mydata.ltr)

Call:
rpart(formula = bwt ~ gestation + parity + age + height + weight + 
    smoke, data = mydata)
  n= 1174 

          CP nsplit rel error    xerror       xstd
1 0.13988404      0 1.0000000 1.0013437 0.04559561
2 0.03749962      1 0.8601160 0.8835265 0.03843883
3 0.03540682      2 0.8226163 0.8701648 0.03801808
4 0.02220364      3 0.7872095 0.8253178 0.03565979
5 0.01317769      4 0.7650059 0.7872582 0.03401945
6 0.01058850      5 0.7518282 0.7874725 0.03461443
7 0.01000000      6 0.7412397 0.7827675 0.03458011

Node number 1: 1174 observations,    complexity param=0.139884
  mean=119.4625, MSE=335.654 
  left son=2 (303 obs) right son=3 (871 obs)
  Primary splits:
      gestation < 272.5 to the left,  improve=0.139884000, (0 missing)
      smoke     < 0.5   to the right, improve=0.060910000, (0 missing)
      weight    < 115.5 to the left,  improve=0.031685720, (0 missing)
      height    < 64.5  to the left,  improve=0.027718710, (0 missing)
      age       < 26.5  to the left,  improve=0.002375203, (0 missing)
  Surrogate splits:
      height < 57.5  to the left,  agree=0.744, adj=0.007, (0 split)

Node number 2: 303 observations,    complexity param=0.03540682
  mean=107.8449, MSE=344.5205 
  left son=4 (24 obs) right son=5 (279 obs)
  Primary splits:
      gestation < 239.5 to the left,  improve=0.133656200, (0 missing)
      smoke     < 0.5   to the right, improve=0.084036650, (0 missing)
      weight    < 159.5 to the left,  improve=0.041531200, (0 missing)
      height    < 66.5  to the left,  improve=0.031843840, (0 missing)
      age       < 25.5  to the left,  improve=0.009166512, (0 missing)

Node number 3: 871 observations,    complexity param=0.03749962
  mean=123.504, MSE=269.2833 
  left son=6 (432 obs) right son=7 (439 obs)
  Primary splits:
      gestation < 283.5 to the left,  improve=0.063002730, (0 missing)
      smoke     < 0.5   to the right, improve=0.047507690, (0 missing)
      weight    < 118.5 to the left,  improve=0.029589520, (0 missing)
      height    < 63.5  to the left,  improve=0.028615770, (0 missing)
      parity    < 0.5   to the right, improve=0.005835875, (0 missing)
  Surrogate splits:
      height < 63.5  to the left,  agree=0.549, adj=0.090, (0 split)
      smoke  < 0.5   to the right, agree=0.537, adj=0.067, (0 split)
      parity < 0.5   to the left,  agree=0.532, adj=0.056, (0 split)
      weight < 120.5 to the left,  agree=0.526, adj=0.044, (0 split)
      age    < 22.5  to the right, agree=0.519, adj=0.030, (0 split)

Node number 4: 24 observations
  mean=84.70833, MSE=425.3733 

Node number 5: 279 observations,    complexity param=0.02220364
  mean=109.8351, MSE=287.557 
  left son=10 (127 obs) right son=11 (152 obs)
  Primary splits:
      smoke     < 0.5   to the right, improve=0.10905760, (0 missing)
      height    < 66.5  to the left,  improve=0.04819964, (0 missing)
      weight    < 159.5 to the left,  improve=0.03956507, (0 missing)
      gestation < 266.5 to the left,  improve=0.03629326, (0 missing)
      age       < 25.5  to the left,  improve=0.01503832, (0 missing)
  Surrogate splits:
      age       < 21.5  to the left,  agree=0.595, adj=0.110, (0 split)
      weight    < 112.5 to the left,  agree=0.570, adj=0.055, (0 split)
      gestation < 245.5 to the left,  agree=0.556, adj=0.024, (0 split)
      height    < 57.5  to the left,  agree=0.548, adj=0.008, (0 split)

Node number 6: 432 observations,    complexity param=0.01317769
  mean=119.3519, MSE=231.4642 
  left son=12 (176 obs) right son=13 (256 obs)
  Primary splits:
      smoke     < 0.5   to the right, improve=0.05193161, (0 missing)
      weight    < 123.5 to the left,  improve=0.04342411, (0 missing)
      height    < 65.5  to the left,  improve=0.02616183, (0 missing)
      gestation < 278.5 to the left,  improve=0.01589250, (0 missing)
      parity    < 0.5   to the right, improve=0.01288037, (0 missing)
  Surrogate splits:
      weight < 107.5 to the left,  agree=0.606, adj=0.034, (0 split)
      age    < 18.5  to the left,  agree=0.602, adj=0.023, (0 split)
      height < 69.5  to the right, agree=0.595, adj=0.006, (0 split)

Node number 7: 439 observations,    complexity param=0.0105885
  mean=127.59, MSE=272.8387 
  left son=14 (147 obs) right son=15 (292 obs)
  Primary splits:
      smoke     < 0.5   to the right, improve=0.03483565, (0 missing)
      height    < 63.5  to the left,  improve=0.03110009, (0 missing)
      weight    < 108.5 to the left,  improve=0.02421865, (0 missing)
      age       < 29.5  to the left,  improve=0.01153499, (0 missing)
      gestation < 304.5 to the right, improve=0.01023850, (0 missing)
  Surrogate splits:
      height    < 69.5  to the right, agree=0.672, adj=0.020, (0 split)
      gestation < 323.5 to the right, agree=0.670, adj=0.014, (0 split)
      age       < 40.5  to the right, agree=0.667, adj=0.007, (0 split)

Node number 10: 127 observations
  mean=103.7087, MSE=248.6474 

Node number 11: 152 observations
  mean=114.9539, MSE=262.5045 

Node number 12: 176 observations
  mean=115.1705, MSE=226.5619 

Node number 13: 256 observations
  mean=122.2266, MSE=214.5502 

Node number 14: 147 observations
  mean=123.2449, MSE=285.3006 

Node number 15: 292 observations
  mean=129.7774, MSE=252.2758 


_________________________________
Tree Plot 2.(from rpart package)
_________________________________

R command> plot(mydata.ltr);  text(mydata.ltr)