Statistical Data Analysis

Series: sci November 10, 2018

Overview of Types of Variables

  • All Variables
    • Numerical
      • Continuous
      • Discrete
    • Categorical / Nominal
      • Regular Categorical
      • Ordinal

Associated vs Independent Variables

When two variables show some connection with one another, they are called associated (or dependent) variables.

If two variables are not associated, then they are independent.

Explanatory and Response Variables

An explanatory variable is one variable in a pair of variables which we suspect affects the other. Labeling a variable as explanatory of another variable does not guarantee the relationship between the two is actually causal. It is merely a label so we can keep track of which variables we suspect affects the other.

In a pair of variables, where one is an explanatory variable, the other is the response variable.

One could say that an explanatory variable might affect a response variable.

Observational Studies and Experiments

An observational study occurs when researchers observe subjects, as opposed to imposing a treatment on their subjects.

In an experiment, researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.

Sampling and Sources of Bias

A census is when we sample the entire population.

It is difficult to take a census.

Sampling Bias

There are multiple reasons for sampling bias:

Non-response can occur if only a small fraction of the sample respond to a survey, the sample may no longer be representative of the population.

Voluntary response occurs when the sample consists of people who volunteer to respond becuase they have strong opinions, causing the responses to not be representative of the population.

Convenience samples are samples with a higher proportion of people who are more easily accessible than the complete population.

It is possible to have a large sample, but for that sample to have a bias that leads to significant issues with the conclusions we can draw from the sample.

Almost all statistical methods are based on the notion of implied randomess.

Common Sampling Techniques

Simple Random Samples randomly select cases from the entire population, where there is no implied connection between the points selected.

Stratified Samples are samples made up of random samples from non-overlapping subgroups. Each subgroup is called a stratum (plural, strata).

Cluster Samples are samples where the researcher divides the population into groups called clusters. Subgroups are created such that each group should have a similar population. When the clusers are created, we sample a simple random sample from within each cluster.

Experimental Design

Principles of Experimental Design

Researchers attempt to control for differences between treatment groups.

Researchers randomize patients into treatment groups to account for variables that cannot be controlled.

The more cases we observe, the more we can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. We can also replicate an entire study to verify earlier findings.

When researchers are aware of other variables (other than the explanatory variable) that may influence the result, they may first group individuals based on this variable into blocks and then randomize cases within each block. Blocking is like stratifying.

Experimental Terminology

Placebos are fake treatments administered to the control group in medical studies.

Blinding is when experimental units do not know if they are in the control or treatment groups.

Double-blinding is when both experimental units and the researchers do not know who’s in the control andwho’s in the treatment group.

Random Assignment vs Random Sampling

X Random Assignment No Random Assignment
Random Sampling Causal and Generalizable (Ideal Experiment) Not Causal but Generalizable (Most Observational Studies)
No Random Sampling Causal but not Generalizable (Most Experiments) Neither Causal nor Generalizable (Bad Observational Studies)


Variance is roughly the average squared deviation from the mean.

We use the squared deviation to get rid of negatives, so that observations equally distant from the mean are weighted equally, and to weigh larger deviations more heavily.

Standard Deviation

Standard Deviation is the square root of the variance.

It has the same units as the data.

Interquartile Range

The 25th percentile is also called the first quartile or Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile or Q3.

The range between Q1 and Q3 is called the Interquartile Range or IQR.

Mean vs Median

If the distribution is skewed, we can determine if it is right-skewed or left-skewedf based on the difference in the mean and median.

Right-skewed is when the mean > the median.

Left-skewed is when the mean < the median.

Null vs Alternate Hypothesis

To determine there is something interesting in your data, you must first prove there wasn’t nothing going on with your data.

The hypothesis that nothing interesting is happening is the Null Hypothesis. This represents the status quo.

Once you have proven the Null Hypothesis is false, you may examine the Alternate Hypothesis.

The Alternative Hypothesis represents the research question we’re testing for.

Normal Distribution

A normal distribution is a unimodal and symmetric, bell shaped curve.

We write this as:

πœ‡: mean

𝜎: Standard Deviation

# A Graph of a normal distribution of mean 3 and standard deviation 5
using Plots
using StatPlots
using Distributions
plot(Normal(3,5), title="Normal Distribution N(3,5)", lw=3,)
β”Œ Info: Precompiling StatPlots [60ddc479-9b66-56df-82fc-76a74619b69c]
β”” @ Base loading.jl:1186


Lots of things have nearly normal distributions. SAT scores are distributed nearly normally with mean 1500 and standard deviation of 300.

# A graph of the SAT distribution
plot(Normal(πœ‡,𝜎), title="Normal Distribution N($(πœ‡),$(𝜎))", lw=3, label=["SAT Distribution"])
vline!([1200,1800], label="1 Standard Deviation")
vline!([900, 2100], label="2 Standard Deviations")
vline!([600, 2400], label="3 Standard Deviations")


Z Scores

Z scores are a way of standardizing scores across normal distributions.

We could use them, for example, to compare an ACT and SAT score, both of which have nearly normal distributions.


A percentile is the percentage of observations that fall below a given datapoint.

Graphically, percentile is the area below the probability distribution curve to the left of that observation.

# A graph of the SAT distribution
function signifChop(num, digits)
    if num == 0.0 then
        return num
        e = ceil(log10(abs(num)))
        scale = 10^(digits - e)
        return trunc(num * scale) / scale

using Rmath
percentile = signifChop(pnorm(1800,πœ‡,𝜎)*100,3)
plot(Normal(πœ‡,𝜎), title="Normal Distribution N($(πœ‡),$(𝜎)) to $(percentile) percentile", lw=3, label=["SAT Distribution"], fillrange=0)


# We can calculate percentiles with R/Julia:

Ceneral Limit Theorem

In some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.

This means we can use statistical methods on normal distributions to analyze data that does not begin as a normal distribution.


  1. Independence: Sampled observations must be independent.
    • Random sample/assignment.
    • If sampling without replacement, then n must be less than 10% of the population.
  2. Sample and Skew: Either the population distribution is normal, or the sample size must be large (usally n>30).

Margin of Error

Confidence Interval for difference between two means

All confideence intervals stay the same:

Our point estimate becomes:

And our standard error changes slightly:


A p-value is the probability of observing data at least as favorable to $H_A$ as our current data set, if in fact $H_0$ were true.

If a p-value is low (usually lower than 5%) then we are able to reject $H_0$.

Calculatig a p-value

The p-value can be calculated as the percentile of the normal distribution given $\bar{x}$, $\sigma$, and $\mu$.:

We can also represent this in terms of Z:

We can also implement this in Julia/R:

# Note the '1-' to account for the > rather than <.

Simulating for a p-value

We can compare our results to a simulation to calculate a p-value:

function simulate(success_count, fail_count)
    g1 = []
    g2 = []

    for i in 1:success_count
        if rand(1:2) == 1
            append!(g1, 'A')
            append!(g2, 'A')
    for i in 1:fail_count
        if rand(1:2) == 1
            append!(g1, 'B')
            append!(g2, 'B')

    return (g1, g2)

differences = []
simulation_count = 10000
for i in 1:simulation_count
    g1, g2 = simulate(35, 13)
    append!(differences, (length(findall(g1 .== 'A'))/length(g1))-(length(findall(g2 .== 'A'))/length(g2)))

histogram(differences, bins=:scott, labels=["difference"])
plot!(title = "Frequency of Difference over $(simulation_count) simulations")


# Calculate P value
length(findall(x -> (x >= 0.3) || (x <= -0.3),differences))/length(differences)


When the sample size is too small to use Centeral Limit Thorem, we need to use t-distributions.

They are always centered at zero.

They have a single parameter, degrees of freedom (df).

plot(Normal(0,1), label="Normal(0,1)")
plot!(TDist(2), label="TDist(2)")
plot!(TDist(5), label="TDist(5)")
plot!(TDist(10), label="TDist(10)")


The p value is still calculated as the area under the t-distribution.

using Rmath
2*pt(4.94,9, false)


  • Slides from Mine Γ‡etinkaya-Rundel of OpenIntro adapted under a CC BY-SA license by Manjeet Rege.

built with , Jekyll, and GitHub Pages — read the fine print