SAMPLING FROM A POPULATION OR PROCESS CHARACTERIZED
BY A NORMALLY DISTRIBUTED RANDOM VARIABLE
Assume a process characterized by a normally distributed random variable X with
a given mean E(X) and given standard deviation
x. Now we are going to take random samples from
this process where each sample consists of n observations. From these n
observations we calculate the mean of the n observations by simply
adding the n observations and then dividing by n. Denote the observations from
this process by X1,X2,......Xn. The mean of this sample of n observations
is then the sum of the Xi from 1 to n divided by n. Note that since each
observation Xi is a random variable from the process, the sample of n
observations from the process is obtained by chance. And since the mean of
the n observations is obtained from the Xi, then the mean itself must have
been obtained by chance. That is, the mean of the sample must itself be a
random variable X-bar. It is important to keep in mind that X-bar is
obtained from the Xi drawn from the process for the n observations in the
sample and that X-bar is therefore itself a random variable.
What is the mean or expected value of X-bar? Recall the formula for X-bar:
X-bar =
Xi/n
= 1/n{
Xi}
= 1/n{X1 + X2 + .... + Xn}. It is important to note that since the
Xi have the same probability distribution as the process random variable
X, each Xi has the same mean and standard deviation as the process random
variable X: E(X) and
x respectively. Thus if we
take the expected value of X-bar above we obtain the following
E(X-bar) = 1/n{E(X1) + E(X2) + ... + E(Xn)}
= 1/n{E(X1) + E(X2) + .... + E(Xn)
= 1/n{n*E(X))
= E(X) What the above derivation shows is that the expected value of
the sample mean X-bar is equal to the mean E(X) of the process or
population from which the samples are randomly drawn. That is, if we take
millions of samples of size n (that is, samples with n observations) from
the process, the X-bar from the samples will bounce around randomly but
they will bounce around the value E(X).
What is the variance and standard deviation of X-Bar? Again note that
X-bar =
Xi/n
= 1/n{
Xi}
= 1/n{X1 + X2 + .... + Xn}. Now we calculate VAR(X-bar) just as we
calculate E(X-bar) above. By a similar derivation to the one above, we can
show that VAR(X-bar) = VAR(X)/n. That is, the variance of the sample mean
is equal to the variance of the process random variable divided by the
number of observations in the sample. The standard deviation
x-bar is thus equal to the standard deviation of
X divided bx the square root of n. The next issue is: what is the nature
of the probability distribution of X-bar? Now recall that the probability
distribution of the process X is normal. It can be shown that since the
probability distribution of the process X is
normal, then the probability distribution of the X-bar -- called the
sampling distribution of X-bar --is also normal. This is of critical
importance for we can now bring together our major results.
If the distribution of some process random variable
X is normal than the distribution of the sample mean X-bar
of n observations from the process is also normal with
a mean equal to the mean of X and a standard deviation
equal to the standard deviation of X divided by the
square root of the number of observations in the sample.
The theory that we have just developed allows us to use the normal table
to address TRICK I and TRICK II problems. I am going to ask you to do the
following problems: 1. TRICK I: Find P(X-bar>=52) given that a process
random variable X~N(50,256) and a random sample of 64 observations drawn
from the process. 2. TRICK II: Given that a process random variable
X~N(50,256), a random sample of 64 observations drawn from the process and
P(X-bar>=X-barc) = .159, find X-barc.
EXAMPLES OF HYPOTHESIS TESTING USING RANDOM SAMPLES DRAWN FROM A NORMALLY
DISTRIBUTION PROCESS RANDOM VARIABLE Assume that a population is normally
distributed with an asserted mean of 50 and a standard deviation of 16.
EXAMPLE I: A researcher wishes to test the hypothesis that the mean of the
population is the asserted mean of 50 against the alternative hypothesis
that the mean is less than 50. She proposes to draw a random sample of 64
observations out of the populatio n. SOLUTION: The null and alternative
hypotheses are as follows: Ho: ux = 50 Ha: ux < 50 Noting the statement
of the alternative hypothesis, this is a left-tailed test. The researcher
chooses an alpha equal to .05. Note that the .05 is a probability that
is the area in the left tail of the sampling distribution of X-bar. The
important question is to find the value of X-barc associated with .05. We
are interested in the following trick II problem: Pr(X-Bar<=X-barc) = .05.
Find X-barc. To solve this we note that the value of Zc associated with
.05 in the left tail is -1.645 which can be obtained with suitable use of
the Z table in the text. Now we solve the following equation for X-barc:
-1.645 = (X-barc - 50)/2. Solving for X-barc, we obtain 46.71. Thus we set
up the following decision rule in the following two equivalent ways:
1. Reject Ho in favor of Ha if X-bar*<=46.71 (this interval is called a
rejection or critical region). 2. Reject Ho in favor of Ha if Z*<=-1.645.
(this interval is called the rejection or critical region).
EXAMPLE II: A researcher wishes to test the hypothesis that the mean of
the population is the asserted mean of 50 against the alternative
hypothesis that the mean is greater than 50. She proposes to draw a random
sample of 64 observations out of the popul ation. SOLUTION: The null and
alternative hypotheses are as follows: Ho: ux = 50 Ha: ux > 50 Noting the
statement of the alternative hypothesis, this is a right tailed test. The
researcher chooses an alpha equal to .05. Note that the .05 is a
probability that is the area in the right tail of the sampling
distribution of X-bar. The important question is to find the value of
X-barc associated with .05. We are interested in the following trick II
problem: Pr(X-bar>=X-barc) = .05. Find X-barc. To solve this we note that
the value of Zc associated with .05 in the left tail is 1.645 which can be
obtained f rom the Z table in the text. Now we solve the following
equation for X-barc: 1.645 = (X-barc - 50)/2. Solving for X-barc, we
obtain 53.29. Thus we set up the following decision rule in the following
two equivalent ways: 1. Reject Ho in favor of Ha if X-bar*>=53.29.(this
interval is called a rejection or critical region). 2. Reject Ho in favor
of Ha if Z*>=1.645.(this interval is called a rejection or critical
region).
EXAMPLE III: A researcher wishes to test the hypothesis that the mean of
the population is the asserted mean of 50 against the alternative
hypothesis that the mean is unequal than 50. She proposes to draw a random
sample of 64 observations out of the popu lation. SOLUTION: The null and
alternative hypotheses are as follows: Ho: ux = 50 Ha: ux unequal to 50
Noting the statement of the alternative hypothesis, this is a two-tailed
test. The researcher chooses an alpha equal to .05. Note that the .05 is a
probability that is the area that is evenly split in both tails of the
sampling distribution of X-bar. The impo rtant question is to find the
value of the two X-bar critical's -- one in the left tail and one in the
right tail -- associated with .05 split evenly in both tails. We are
interested in the following trick II problem: Pr(X-bar<=X-barc) = .025 for
the left tail and Pr(X-bar>=X-barc) = .025 for the right tail. Find the
X-barc's associated with the left and right tails. To solve this we note
that the value of Zc associated with .025 (one-half of .05) in the left
tail is -1.96 which can be obtained with suita ble use of the Z table in
the text. Now we solve the following equation for X-barc: -1.96 = (X-barc
- 50)/2. Solving for X-barc, we obtain 46.08 for the X-barc in the left
tail.
To obtain the X-barc associated with the right tail we solve the following
equation for X-barc: +1.96 = (X-barc - 50)/2. Solving for X-barc, we
obtain 53.92 for the X-barc in the right tail. Thus we set up the
following decision rule in the following two equivalent ways: 1. Reject
Ho in favor of Ha if X-bar*<=46.08 or X-bar*>=53.92.(this interval is
called a rejection or critical region). 2. Reject Ho in favor of Ha if
Z*<=-1.96 or Z*>=+1.96. (this interval is called a rejection or critical
region).
Important Remarks:
1. Note that Ho and Ha are not numbers. They refer to
statements about the values of the mean of the underlying process or
population.
2. Note that the null hypothesis (Ho) asserts that the mean
of the underlying population is an exact number. The alternative
hypothesis (Ha) asserts that the mean is an interval.
3. Note that the researcher does not know the mean of the underlying
population. If she did, she would not have to sample and run the
hypothesis test. She would just know. But she doesn't and she takes a
sample and tests statements about some characteristic of the population
(in this case, the population mean). We can say that she makes inferences
about the population mean from the sample. This is why we call hypothesis
testing part of something called statistical inference -- making
inferences about the population from a sample drawn from it.
4. A word about characteristics of a population. Note that the
characteristic that is the subject of the hypothesis test is only one
characteristic. The researcher could also be interested in the population
variance or standard deviation. Characteristics of a population are called
parameters. Notice that they remain invariant across samples drawn from a
population. For example, a population mean doesn't change when another
sample is drawn although the sample mean will, in general, change. The
population mean is a fixed characteristic of a population (fixed across
samples) where a sample mean is a random variable that varies across
samples. A sample mean is also called a sample statistic. We will soon see
that the regression process has parameters -- the main ones of interest
are the B's in the true regression model just as the parameter of interest
right now is the population or process mean.
5. Note that the decision rule above was constructed prior to actually
taking the sample from the population and analyzing the sample data.
Although it may not be obvious now, it is important to construct the
decision rule prior to taking the actual sample.
6. It is important to note that the hypothesis test is run against the
assertion of the null hypothesis. This is why the sample distribution of
X-bar is assumed to be centered around the asserted mean of 50 in
constructing the decision rule for the hypoth esis test. The idea is to
assess the compatibility of the sample results with the asserted mean in
the null hypothesis. If the observed sample mean or Z falls in the
rejection region, the researcher concludes that the sample results are
inconsistent with the null hypothesis and rejects Ho in favor of Ha. If
the researcher does not reject Ho, we say that the observed mean or Z
falls into the acceptance region.
7. Remember that the by employing the decision rule, the researcher could
be making a mistake. The population mean could really be 50 and she will
reject the assertion that it is 50. This is called a Type I error. For the
significance level chosen (alpha), the probability of making a Type I error is
.05. This means that if the population mean is really 50, the
researcher will incorrectly reject the null hypothesis that it is 50 five
percent of the time. What does the phrase "five percent of the time mean"?
It means that using this decision rule, if the researcher would repeatedly
draw samples of size 64 out of the population, given that the mean was
really 50, the researcher would incorrectly reject the assertion that it
was 50 in five percent of all the samples. But this is just another way of
saying that the probability of making a Type I error is five percent. You
might find this puzzling. Why should the researcher accept any risk of
making an Type I error? The probability of making such an error could be
reduced by lowering alpha -- saying lowering it to .01. Then if 50 is
really the mean, the researcher would reduce the probability of making a
Type I error to just .01. The problem with lowering the Type I error
probability is that it increases the probability of making another type of
error -- strangely this is called a Type II error. This is the case where
the Ho is actually not true and the researcher incorrectly concludes (by
applying her decision rule) that Ho is in fact true. The probability of
making a Type II error is called Beta. The problem is, for a given sample
size, that if a researcher reduces alpha, she increases beta. Thus, for a
given sample size, there is a tradeoff between the two probabilities. So
the researcher has the following choices:
a. For a given sample size n, she can reduce alpha and thus accept an
increase in beta.
b. For a given sample size, she can increase alpha and thus obtain the
benefit of a lower beta. 3. She can increase the sample size n and reduce
both alpha and beta. But an increase in the sample size will increase the
costs to the researcher.
8. Why sample? Three basic reasons: a. Drawing the entire population is
expensive. b. When sampling, sometimes the observation unit sampled must
be destroyed. For example, if a manufacturer is testing the breaking
strength of a steel rod, she must break the steel rod. This can be
expensive. c. Accurate results can be taken with a suitably drawn
sample.
9. Remember the basic sampling theory underlying hypothesis testing: a.
The underlying population is assumed to be normal.
b. The sampling distribution of the mean is normal, centered around the
mean and has a standard deviation equal to the standard deviation of the
underlying population divided by the square root of the sample size n.
10. The standard deviation of X-bar is also called the standard error of
X-bar. Notice that the standard error of X-bar is smaller than the
standard deviation of the underlying population and the standard error
becomes smaller the larger the sample size n.
11. X-bar is called an estimator. It is a formula which calculates a
number from a sample which estimates a population parameter (in this case
the mean of an underlying population). X-bar is called a point estimator
since it is one number. This is important because there are estimators
which are intervals -- not just one number.