TABLE 1.2

Reading achievement scores for seventh grader

Class

Number of students

Percent

2.0-2.9

9

.95

3.0-3.9

28

2.96

4.0-4.9

59

6.23

5.0-5.9

165

17.42

6.0-6.9

244

25.77

7.0-7.9

206

21.75

8.0-8.9

146

15.42

9.0-9.9

60

6.34

10.0-10.9

24

2.53

11.0-11.9

5

.53

12.0-12.9

1

.11

Total

947

100.01

 

A Puzzle with A Cute Solution

 Suppose you have a table of data like the one given above. You have enough information to construct a histogram.

But: What is the mean and standard deviation of the data?

How do you get around only having a table summary?

Answer: Treat the data in class 2.0-2.9 as 9 copies of the mid-class value of 2.5

Treat the data in class 3.0-3.9 as 28 copies of 3.5 etc.

Your 'approximate' data list becomes:

  {2.5, …, 2.5, 3.5, …, 11.5, …, 11.5, 12.5}

9 times 28 times 5 times

  Formulae:

 

fi = # of students in class i

mi = midpoint value of class i

n = 947

 

Text (page 66) says:

(pretty close!… works best when the unobserved values in the classes are close to the middle)

 1.3 Mathematical Modelling and Density Curves

 

Suppose you were given the following plot of information

 Suppose somebody asked you "To a reasonable degree of accuracy, what do the data show about the relationship between gas consumption and distance?"

 You might say… there is evidence of an increasing or positive relationship.

 You may go so far as to say:

 

 Note: you have to wait until Econ. 3210 to learn how to measure that straight line relationship.

 By summarizing the relationship as a straight line you have given up some detail (i.e. the relationship is not exactly a straight line) but have gotten, in return, a very accurate mathematical model (i.e. a straight line relation) that is easy to understand, explain and work with (e.g. forecast gas use for other values of distance.)

 In a sense, the mathematical model (straight line is like an 'idealized' representation of the relationship.

 In this chapter (1.3) we do the same kind of thing for distributions: we construct idealized forms called density curves and spend most of our time looking at a particular density curve called the NORMAL.

 

Density Curves

We can think of a density curve as a smooth idealized version of a relative frequency histogram.

 The density curve is an abstraction (soon we will get a test to see if the abstraction is reasonable). It is often easier to work with a density curve than a histogram or a stem plot. It shows more detail than something like a box plot or a 5 number summary but it is not without its faults: it tends to smooth over irregularities such as outliers.

 Examples:

  1. Figure 1.12 a relative frequency historgram (for the Gary student data) and a smooth density curve. 
  2. Figure 1.13 density curves (like histograms) can be symmetric or skewed 
  3. Figure 1.14 the mean of the data is found where the density curve 'balances; just like in the case of the histogram (and the centre of gravity story)

 

  

 

 

 

  

Properties that Density Curves Inherit (from relative frequency histograms) 

  1. The area under the density curve = 1 
  2.  

     

  3. x* = 20th percentile (area to the left of x* is .2 which means that 20% of the x values are less than x*. 50%ile, mean, pth percentile? What about the 50th percentile, the mean, and the pth percentile?
  4.  

     

  5. Because density curves are 'idealized' we refer to the mean as (Greek letter mu) and the standard deviation as (sigma)
  6.  

  7. Note that the density curve sits over the relevant range of the variable whose distribution we are trying to describe (just like a histogram) (e.g. Gary Data)

 

 

Normal Distribution/Density Curve 

(Note: there are often density curves that have these properties but the Normal is 'special' in many other ways.)

 

(1)

 

 

What is so neat about the Normal Density? 

  1. It is symmetric so the median equals the mean.
  2. If you have a data list that might (in an ideal sense) be described by the normal density, then all you need to do is calculate and s, swap them for and in (1) and you have the equation of your density curve.
  3. Just by visualizing the graph of the normal distribution/density function you can figure out the mean and the standard deviation . To see this first look at Figure 1.15 and note the special relationship between the peak of the density and and the 'shoulders' of the density and .
  4. This is always true: Try out a blind test on Figure 1.12 and then compare it with what we already know to be true.

  5. ALL normal density curves (regardless of the size of or ) follow what is known as the 68-95-99.7 rule. In particular:

68%: 68% of the observations (area/relative frequency) lie in an interval one standard deviation either side of the mean

95%: 95% of the observations (area) lie in an interval two standard deviation either side of the mean

99.7%: 99.7% of the observations (area) lie in an interval three standard deviation either side of the mean

 

Special Case Normal Density: Called the Standard Normal Density.