Statistical Data Analysis
Series: sci November 10, 2018
Overview of Types of Variables
- All Variables
- Numerical
- Continuous
- Discrete
- Categorical / Nominal
- Regular Categorical
- Ordinal
- Numerical
Associated vs Independent Variables
When two variables show some connection with one another, they are called associated (or dependent) variables.
If two variables are not associated, then they are independent.
Explanatory and Response Variables
An explanatory variable is one variable in a pair of variables which we suspect affects the other. Labeling a variable as explanatory of another variable does not guarantee the relationship between the two is actually causal. It is merely a label so we can keep track of which variables we suspect affects the other.
In a pair of variables, where one is an explanatory variable, the other is the response variable.
One could say that an explanatory variable might affect a response variable.
Observational Studies and Experiments
An observational study occurs when researchers observe subjects, as opposed to imposing a treatment on their subjects.
In an experiment, researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.
Sampling and Sources of Bias
A census is when we sample the entire population.
It is difficult to take a census.
Sampling Bias
There are multiple reasons for sampling bias:
Non-response can occur if only a small fraction of the sample respond to a survey, the sample may no longer be representative of the population.
Voluntary response occurs when the sample consists of people who volunteer to respond becuase they have strong opinions, causing the responses to not be representative of the population.
Convenience samples are samples with a higher proportion of people who are more easily accessible than the complete population.
It is possible to have a large sample, but for that sample to have a bias that leads to significant issues with the conclusions we can draw from the sample.
Almost all statistical methods are based on the notion of implied randomess.
Common Sampling Techniques
Simple Random Samples randomly select cases from the entire population, where there is no implied connection between the points selected.
Stratified Samples are samples made up of random samples from non-overlapping subgroups. Each subgroup is called a stratum (plural, strata).
Cluster Samples are samples where the researcher divides the population into groups called clusters. Subgroups are created such that each group should have a similar population. When the clusers are created, we sample a simple random sample from within each cluster.
Experimental Design
Principles of Experimental Design
Researchers attempt to control for differences between treatment groups.
Researchers randomize patients into treatment groups to account for variables that cannot be controlled.
The more cases we observe, the more we can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. We can also replicate an entire study to verify earlier findings.
When researchers are aware of other variables (other than the explanatory variable) that may influence the result, they may first group individuals based on this variable into blocks and then randomize cases within each block. Blocking is like stratifying.
Experimental Terminology
Placebos are fake treatments administered to the control group in medical studies.
Blinding is when experimental units do not know if they are in the control or treatment groups.
Double-blinding is when both experimental units and the researchers do not know whoβs in the control andwhoβs in the treatment group.
Random Assignment vs Random Sampling
X | Random Assignment | No Random Assignment |
---|---|---|
Random Sampling | Causal and Generalizable (Ideal Experiment) | Not Causal but Generalizable (Most Observational Studies) |
No Random Sampling | Causal but not Generalizable (Most Experiments) | Neither Causal nor Generalizable (Bad Observational Studies) |
Variance
Variance is roughly the average squared deviation from the mean.
We use the squared deviation to get rid of negatives, so that observations equally distant from the mean are weighted equally, and to weigh larger deviations more heavily.
Standard Deviation
Standard Deviation is the square root of the variance.
It has the same units as the data.
Interquartile Range
The 25th percentile is also called the first quartile or Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile or Q3.
The range between Q1 and Q3 is called the Interquartile Range or IQR.
Mean vs Median
If the distribution is skewed, we can determine if it is right-skewed or left-skewedf based on the difference in the mean and median.
Right-skewed is when the mean > the median.
Left-skewed is when the mean < the median.
Null vs Alternate Hypothesis
To determine there is something interesting in your data, you must first prove there wasnβt nothing going on with your data.
The hypothesis that nothing interesting is happening is the Null Hypothesis. This represents the status quo.
Once you have proven the Null Hypothesis is false, you may examine the Alternate Hypothesis.
The Alternative Hypothesis represents the research question weβre testing for.
Normal Distribution
A normal distribution is a unimodal and symmetric, bell shaped curve.
We write this as:
π: mean
π: Standard Deviation
# A Graph of a normal distribution of mean 3 and standard deviation 5
using Plots
using StatPlots
using Distributions
gr()
plot(Normal(3,5), title="Normal Distribution N(3,5)", lw=3,)
β Info: Precompiling StatPlots [60ddc479-9b66-56df-82fc-76a74619b69c]
β @ Base loading.jl:1186
Lots of things have nearly normal distributions. SAT scores are distributed nearly normally with mean 1500 and standard deviation of 300.
# A graph of the SAT distribution
π=1500
π=300
plot(Normal(π,π), title="Normal Distribution N($(π),$(π))", lw=3, label=["SAT Distribution"])
vline!([1200,1800], label="1 Standard Deviation")
vline!([900, 2100], label="2 Standard Deviations")
vline!([600, 2400], label="3 Standard Deviations")
Z Scores
Z scores are a way of standardizing scores across normal distributions.
We could use them, for example, to compare an ACT and SAT score, both of which have nearly normal distributions.
Percentiles
A percentile is the percentage of observations that fall below a given datapoint.
Graphically, percentile is the area below the probability distribution curve to the left of that observation.
# A graph of the SAT distribution
function signifChop(num, digits)
if num == 0.0 then
return num
else
e = ceil(log10(abs(num)))
scale = 10^(digits - e)
return trunc(num * scale) / scale
end
end
π=1500
π=300
using Rmath
percentile = signifChop(pnorm(1800,π,π)*100,3)
plot(Normal(π,π), title="Normal Distribution N($(π),$(π)) to $(percentile) percentile", lw=3, label=["SAT Distribution"], fillrange=0)
xlims!((0,1800))
# We can calculate percentiles with R/Julia:
pnorm(1800,π,π)
0.8413447460685429
Ceneral Limit Theorem
In some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
This means we can use statistical methods on normal distributions to analyze data that does not begin as a normal distribution.
Conditions:
- Independence: Sampled observations must be independent.
- Random sample/assignment.
- If sampling without replacement, then n must be less than 10% of the population.
- Sample and Skew: Either the population distribution is normal, or the sample size must be large (usally n>30).
Margin of Error
Confidence Interval for difference between two means
All confideence intervals stay the same:
Our point estimate becomes:
And our standard error changes slightly:
p-value
A p-value is the probability of observing data at least as favorable to $H_A$ as our current data set, if in fact $H_0$ were true.
If a p-value is low (usually lower than 5%) then we are able to reject $H_0$.
Calculatig a p-value
The p-value can be calculated as the percentile of the normal distribution given $\bar{x}$, $\sigma$, and $\mu$.:
We can also represent this in terms of Z:
We can also implement this in Julia/R:
# Note the '1-' to account for the > rather than <.
1-pnorm(9.7,8,0.5)
0.0003369292656768552
Simulating for a p-value
We can compare our results to a simulation to calculate a p-value:
function simulate(success_count, fail_count)
g1 = []
g2 = []
for i in 1:success_count
if rand(1:2) == 1
append!(g1, 'A')
else
append!(g2, 'A')
end
end
for i in 1:fail_count
if rand(1:2) == 1
append!(g1, 'B')
else
append!(g2, 'B')
end
end
return (g1, g2)
end
differences = []
simulation_count = 10000
for i in 1:simulation_count
g1, g2 = simulate(35, 13)
append!(differences, (length(findall(g1 .== 'A'))/length(g1))-(length(findall(g2 .== 'A'))/length(g2)))
end
gr()
histogram(differences, bins=:scott, labels=["difference"])
plot!(title = "Frequency of Difference over $(simulation_count) simulations")
# Calculate P value
length(findall(x -> (x >= 0.3) || (x <= -0.3),differences))/length(differences)
0.0229
t-distribution
When the sample size is too small to use Centeral Limit Thorem, we need to use t-distributions.
They are always centered at zero.
They have a single parameter, degrees of freedom (df).
plot(Normal(0,1), label="Normal(0,1)")
plot!(TDist(2), label="TDist(2)")
plot!(TDist(5), label="TDist(5)")
plot!(TDist(10), label="TDist(10)")
xlims!(-4,4)
The p value is still calculated as the area under the t-distribution.
using Rmath
2*pt(4.94,9, false)
0.0008022393577614288
Sources
- Slides from Mine Γetinkaya-Rundel of OpenIntro adapted under a CC BY-SA license by Manjeet Rege.