Interpretation of statistics is not black-and-white. If something is "statistically significant" this does not necessarily mean we should all switch over to the new method .. or new product .. or whatever. As the text explains on p. 62, "statistically significant" is not equivalent to important.
Notice that another issue discussed in this short section pertains to sample size, and, although they don't say it explicitly here, another issue here is power. Increased sample size causes increased power, but this power can be mis-used, even in statistics.
Pay good attention to Case 1 and Case 2 -- goes both ways. These are both connected to the Power of statistical methods.
When you get to p. 65, that last paragraph there, is chock full of many things that are very important to understand.
Chapter 4 Summarizing Data Graphically
Continuing on yet some more in Section 4.4 ...(hey -- that rhymes!)
Recall that Quantitative variables have numerical value that is numerically meaningful (not just used for labels).
4.4.3 Stem-and-Leaf Plots (pp 243-248)
This is basically like a frequency plot, but the number line is vertical, and instead of "x" marks (or dots), the marks are actual digits from the data. So, for example, instead of putting an x over the 32 to show 32, you might put "2" next to "3" to show 32.
You have to look over the particular set of values to decide which place value to use for the stem and then the next place value to the right would be the leaf place value. (For instance, you might have the hundreds place for stems, and then the tens place would be leaves. Or else you might have the ones place for stems, and then the tenths place would be leaves.)
Write the stems along the number line, even if there are no leaves, just like you make tick marks along a number line in a graph, even if you don't have any values to put there.
One cool thing about a stem-and-leaf plot (also called a "stemplot") is that you can go along your data list in order (you do NOT have to sort it first) and write down each leaf, just like when you plot points in a graph, except here you write down the digit instead of making a dot or an "x." You should end up with exactly the same number of leaves as your sample size -- one leaf per unit in the sample.
Pay attention to the Think About It (TAI) on p. 248 to make sure that you understand what is happening with each of these stemplots.
4.4.4 Histograms (pp 252-259)
The idea here is a basically like frequency plots in many ways. First of all, since the variable is quantitative, a number line is appropriate. But usually histograms are used instead of frequency plots when the sample size is so large that the number of "x" marks for a frequency plot would just be absurd! So instead of an individual "x" mark (or dot) for each data value, the data values are grouped into "classes" (that's what they are called), for instance, 10 to 19, 20 to 29, 30 to 39, and so on. Then a bar is drawn (a lot like with a bar graph) for each "class" to show how many values (the "frequency") occur in that class.
Read the "Basic Steps" on page 253. Some additional comments:
Before dividing the "smallest to largest range" into classes, we sometimes round the MIN and the MAX to more convenient numbers -- round down from the smallest and round up from the largest. For example, if MIN=22 and MAX=47, we might consider the "smallest to largest range" as 20 to 50 for display purposes.
How many classes? Some books have specific guidelines. There is no set rule. In general, not too few classes, and not so many classes that you hardly have any values in them.
At the top of p. 254, those two histograms appear to be identical! In fact they are displaying the exact same set of numbers, but one tells you how many people are in each class, whereas the other tells you the percentage out of 100%. Sometimes people make two-in-one histograms, with the vertical axis on one side to show the actual numbers of people (frequency) and on the other side, a vertical axis is there to show the percent out of the whole group for each class (relative frequency).
IMPORTANT: Normally, the classes are equal-width, so the bars drawn should all have the same width. It is the height of the bars that give the information about the distribution. Also, the bars should touch unless there is a class that has no values at all in it in which case that empty space actually gives information. (Recall that in a bar graph, there are spaces between the bars, but bar graphs are for categorical variables, and the order of the bars is arbitrary. Read pages 258-259 carefully.)
Chapter 5 Summarizing Data Numerically
Now we continue on in Chapter 5.
Last time we looked at measures of center, and now we move on to measures of spread.
5.3 Measuring Variation or Spread (pp 312-333)
As seen previously, measures of center, tell us where the "center" is. But, it can also help to know how spread out the values are.
5.3.1 Range (p. 313)
To get the range, just subtract the MAX – MIN = RANGE.
In normal every-day speaking, we might say the "range" of the data is from 20 to 32, but, technically the range is actually a statistic and is one value, which, in this case, would be 12 because 32-20 = 12.
5.3.2 Interquartile Range (pp 313-315)
This is exactly the same as the range, except with quartiles; so we have to learn about quartiles first!
If you remember "percentiles" it might help you to know:
25th percentile = First Quartile
50th percentile = Second Quartile
75th percentile = Third Quartile
Note: The median = Second Quartile which equals the "middle value" so hopefully that makes some sense.
The basic idea is that the median splits the data set into a "low half" and a "high half."
Then the median of the lower half IS the first quartile (Q1) (also called the "lower quartile"). And the median of the upper half IS the third quartile (Q3) (also called the "upper quartile").
To get the interquartile range, just subtract: Q3 – Q1 = IQR.
5.3.3 Five-Number Summary (starts on p. 315)
The five numbers we want for this summary are (in order from smallest to largest):
MIN (the minimum, the smallest value)
Q1 (the first quartile)
Q2 (the second quartile, which is also the median)
Q3 (the third quartile)
MAX (the maximum, the largest value)
To find the median everyone should get exactly the same value. But for Q1 and Q3 there actually are different formulas in different books, so you could end up with slightly different numbers depending on how you do it. But they should all be close! So it is OK if your Q1 and/or Q3 does not match someone else's or does not match what is in the back of the book. The whole idea though, is that Q1 should be in the middle of the "lower half" and Q3 should be in the middle of the "upper half" and 25% of the values are less than or equal to Q1 and 25% of the values are greater than or equal to Q3.
We use the five-number summary to draw a boxplot (also called a "box-and-whiskers plot").
The five numbers are displayed in a horizontal boxplot in this way:
MIN is either a dot or a left-endpoint of a whisker
Q1 is the left edge of the box
the median is the line usually inside the box
Q3 is the right edge of the box
MAX is either a dot or a right-endpoint of a whisker
Boxplots can be drawn either horizontally or vertically. The above description is for one drawn horizontally, over a number line like we usually do a number line.
The boxplot is an excellent way for us to get a lot of information about a set of numbers visually very quickly! Examples 5.7 and 5.8 (pp 322-4) try to make the very important point that we cannot really infer the shape of the distribution just from the shape of the boxplot, but they could have used better examples, frankly.
The important point about the modified boxplot is that if there are extremely high or extremely low values that are far away from the group, then we do not want to draw the whisker all the way out from the box to that extreme value or values because that gives a misleading image. Look at any of the boxplots on pages 318 to 320. Look at those that have outliers shown as separate dots, and consider how they would look if you had (instead) continued the line from the box all the way out to that dot. People would get the idea that values were spread out all along the way. Showing the outlier as a separate dot shows us all that space in between where there are no values so we see the big gap.
5.3.4 Standard Deviation (pp 326-333)
The standard deviation is an extremely important measure of spread in many situations.
The standard deviation is based on the mean of the set of numbers.
The deviation of an individual value is how far it is from the mean. (If the value is above the mean, its deviation is positive; if the value is below the mean, its deviation is negative; if the value is equal to the mean, then its deviation is 0.)
The idea of the standard deviation is that it gives us a measure of sort of the average (but not exactly) of the magnitudes of the deviations of all of the different numbers in the data set.
Important concepts to understand (for now) about standard deviation:
In a set of numbers, if all the values are exactly the same, then the standard deviation is zero (0) because there is no deviation at all.
If you have two sets of numbers, and one has a larger standard deviation than the other, then the one with the larger standard deviation has values that are more spread out than the other set of numbers.
Homework Assignment #4
Reading -- See the Reading Assignment in "Lessons" in myCR and read the corresponding parts of the textbook.
Reference Book -- Just a reminder that while you are reading, keep writing information in it.
Forum: Decision -- In the myCR Forum, post a description of an "either/or" type of decision you have faced yourself (or had to make) in your personal life (or you can make up a fictitious one). Describe what both the choices were, and then describe both types of errors that could have resulted, as well as some possible consequences of each error. Post a reply to at least two other people in the class also.
Data Project #1, part 2 -- After your idea is approved, collect data. Remember that the variable should be quantitative (continuous) from samples from two different populations. The variable of interest should be the same for both samples and we will compare the two later on. You should take all the measurements yourself. The two populations should be distinct and there should be no question about which population any sampling unit belongs in. See "Assignments" in myCR for more details. The write-up is due June 13.
NOTE: The sample sizes are to be at least 30 for each of your two samples.