Day 1
Descriptive Statistics Review:
Shape: Mound Shaped, Uniform, Bimodal, Symmetrical, and Skewed (left or right)
Center: Mean, Median and Mode.
Mean: Total of the sum of the numbers/ Total number of entries, for example for the data set of the following: 4,5,6,8,12 the mean would be 4+5+6+8+12=35/5= 7
Median: The middle of the data. For example if your sample has n=30 then your median value is the 15th number. Since our data set above has an odd number of values we take the middle value of 6. If we had an even amount of data values then we would take the middle two numbers add them together and divide by two.
Mode: The number that appears the most often within the sample.
What is the mode of the following data set: 5,6,7,7,8? 7 is the correct answer because it is the number that appears the most within the set.
Day 2
Spread : Range, Standard Deviation and Variance.
Range: the difference between the maximum value of the sample and the minimum value of the sample.
Standard Deviation: measures how much each value in a set of data differs from the mean.
Day 3
Data set: a data set consisting of observations on a single attribute in a univariate data set. A univariate data set is either categorical(i.e. Qualitative) or numerical(i.e. Quantitative).
Types of Data:
1. Categorical: if the individual responses are categorical responses. For example a person's hometown would be a categorical data set.
2. Numerical: If each observation is a number. So the GPA of each individual would be an example of a numerical response.
In some studies, you will focus on two different attributes. For example, the GPA for someone and the number of classes they are enrolled for that person might be recorded for each individual in a group. The resulting data set would consist of a pair of numbers, such as (3.4,4). This is called the bivariate data set.
Day 4
Types of Samples:
1. Simple Random Sample: a sample chosen in such a way that every sample of n objects has an equal chance of being selected.
2. Stratified Random Sample: Divide population into different strata. Select a simple random sample for each strata. (for example: Divide the USA into different regions as strata and then use a simple random sample to choose each state.)
3. Systematic Sample: Sampling every Kth item (for example: using every 10th item within the sample.)
Day 5
Types of Bias within Sampling:
Bias: some part of the sample population that is systematically favored.
1. Selection Bias: Tendency for samples to differ from the corresponding population as a result of systematic exclusion of some part of the population.
2. Measurement Bias: Tendency for samples to differ from the corresponding population because the method of observation tends to produce values that differ from the true value.
3. Nonresponse Bias: Tendency for samples to differ from the corresponding population because the data is not obtained from all individuals selected for inclusion in the sample.
4. Response Bias: Sample where the survey is a leading question.
Homework Problems for Sampling
Day 6
Displaying Catergorical Data: (1) Frequency Distributions (2) Bar charts (3) Pie charts
(1)Frequency Distribution for categorical data: is a table that displays the possible categories along with the associated frequencies or relative frequencies.
The Frequency for a particular category is the number of times the category appears in the data set.
The Relative Frequency for a particular category is the fraction or proportion of the time that the category appears in the data set. It is calculated as:
relative frequency= frequency/# of observations in the data set
2. Bar Charts: use when constructing categorical data.
a. Draw a horizontal line, and write the category names or lables below the line at regularly spaced intervals.
b. Draw a vertical line, and label the scale using either frequency or relative frequency.
c. Place a rectangular bar above each categorical label. The height is determined by the category's frequency, and all bars should have the same width. With the same width, both the height and the area of the bar are proportional to the relative frequency.
3. Pie Charts: use with Categorical data with a relatively small number of possible categories. Pie charts are most useful for illustrating proportions of the whole data set for various categories.
Construct by:
1. Drawing a circle to represent the entire data set.
2. For each category, calculate the "slice"size. This is done by computing slice size equal to category relative frequency multiplied by 360.(since there are 360 degrees in a circle)
3. Draw a slice of appropriate size for each category. This can be tricky, so most pie charts are generated using a graphing calculator or a statistical software package.
Day 7
Probability
P(E): number of outcomes favorable to E/number of outcomes in the sample space
Basic properties of Probability:
1. For any event E, 0<P(e)<1
2. If S is the sample space for an experiment, P(s)=1
3. It two events E and F are disjoint, then P(E or F)=P(E) +P(F)
4. For any event E, P(E) + P (not E) =1 so
P(not E)=1 -P(E) and P(E)=1-P(not E)
Counting Principle: If you can do one task in m ways and for each of these, you can do another task n ways, then the number of the ways the two tasks can be done is m*n ways then mn ways= m * n ways
Combinations (order does matter) : Number of ways you can select a committee. For example: ( 5 choose 3)
Permutations (order doesn't matter) : For Example 5*4*3=60 possible choices for the committee.
Day 8
Probability of Mutually Exclusive Events: If two events, A and B are mutually exclusive, then the probability that either A or B occurs is the sum of their probabilities.
P(A or B)= P(A) +P(B)-P(A and B)
General Multiplication Rule: P(E/F)= P ( E and F)/P(F)
For any two events E and F, P(E and F)=P(E/F)P(F)
Conditional Probability
P(A/B) read the Probability of A given B. =P(A and B) divided by the P(B).
P (A)= .7 P(B)=.6
P(A and B)=.54
P(A/B)=.54/.70=.771
Homework Problems for Probability
Day 9
A Confidence Interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed so that, with a chosen degree of confidence, the value of the characteristic will be captured inside the interval.
The Confidence Level associated with a confidence interval estimate is the success rate of the method used to construct the interval.
Large Sample Confidence Interval for Pie
The general formula for a confidence interval for a population proportion pie when:
1. p is the sample proportion from a random sample, and
2. is the sample size n is large (np>10) and n(1-p)>10 is
p + or - (z critical value) (square root of p(1-p)/n)
The desired confidence level determines which Z critical value is used. The three most commonly used confidence intervals are 90%, 95%, and 99%, use Z critical values 1.645,1.96,and 2.58.
One sample Z Confidence Interval for Mu
The general formula for a confidence interval for a population mean mu when:
1. x is the sample mean from a random sample
2. the sample size n is large (generally n.>30), and
3. sigma, the population standard deviation is known is
x plus or minus (z critical value) (sigma/square root of n)
Day 10
Hypothesis Testing
A test of hypotheses is a method for using sample data to decided between wo competing claims(hypotheses) about a population characteristic.
Mu=1000 where mu is the mean number of characters in an email message.
pi is less than .01 where pi is the proportion of e-mail messages that are undeliverable.
One hypotheses might be mu=1000 and the other mu is not equal to 1000.
The Null Hypothesis, denoted Hnot, is a claim about a population characteristic that is initially assumed to be true.
The Alternative Hypothesis, Denoted Ha, is the competing claim.
In doing a test of Hnot versus Ha, the hypothesis Hnot will be rejected in favor of Ha, only if sample evidence strongly suggests that Hnot is false. If the sample does not contain such evidence, Hnot will not be rejected. The two possible conclusions are then Reject Hnot or fail to reject Hnot.
The form on a null hypothesis is:
Hnot: population characteristic=hypothesized value where the hypothesized value is a specific number determined by the problem context.
The alternative hypothesis will have one of the following three forms:
1. Ha: population characteristic > hypothesized value
2. Ha: pop. char. < hyp. value
3. Ha: pop. char not equal to hyp. value.
Errors in Hypothesis Testing:
Type I error: the error of rejecting Hnot when Hnot is true.
Type II error: The error of failing to reject Hnot when Hnot if false.
The probability of a type I error is denoted by Alpha and is called the level of significance of the test. Thus, a test with alpha =.01 is said to have a level of significance of .01 or to be a level .01 test.
The probability of a type II error is
denoted by B.