# Statistical Data Analysis

Series: sci November 10, 2018

## Overview of Types of Variables

- All Variables
- Numerical
- Continuous
- Discrete

- Categorical / Nominal
- Regular Categorical
- Ordinal

- Numerical

## Associated vs Independent Variables

When two variables show some connection with one another, they are called **associated** (or **dependent**) variables.

If two variables are not associated, then they are **independent**.

## Explanatory and Response Variables

An **explanatory variable** is one variable in a pair of variables which we suspect affects the other. Labeling a variable as explanatory of another variable does not guarantee the relationship between the two is actually causal. It is merely a label so we can keep track of which variables we *suspect* affects the other.

In a pair of variables, where one is an **explanatory variable**, the other is the **response variable**.

One could say that an **explanatory variable** *might affect* a **response variable**.

## Observational Studies and Experiments

An **observational study** occurs when researchers observe subjects, as opposed to imposing a treatment on their subjects.

In an **experiment**, researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.

## Sampling and Sources of Bias

A **census** is when we sample the entire population.

It is difficult to take a census.

## Sampling Bias

There are multiple reasons for sampling bias:

**Non-response** can occur if only a small fraction of the sample respond to a survey, the sample may no longer be representative of the population.

**Voluntary response** occurs when the sample consists of people who volunteer to respond becuase they have strong opinions, causing the responses to not be representative of the population.

**Convenience samples** are samples with a higher proportion of people who are more easily accessible than the complete population.

It is possible to have a large sample, but for that sample to have a bias that leads to significant issues with the conclusions we can draw from the sample.

Almost all statistical methods are based on the notion of implied randomess.

## Common Sampling Techniques

**Simple Random Samples** randomly select cases from the entire population, where there is no implied connection between the points selected.

**Stratified Samples** are samples made up of random samples from non-overlapping subgroups. Each subgroup is called a **stratum** (plural, **strata**).

**Cluster Samples** are samples where the researcher divides the population into groups called **clusters**. Subgroups are created such that each group should have a similar population. When the clusers are created, we sample a simple random sample from within each cluster.

## Experimental Design

### Principles of Experimental Design

Researchers attempt to **control** for differences between treatment groups.

Researchers **randomize** patients into treatment groups to account for variables that cannot be controlled.

The more cases we observe, the more we can estimate the effect of the *explanatory variable* on the response. In a single study, we **replicate** by collecting a sufficiently large sample. We can also **replicate** an entire study to verify earlier findings.

When researchers are aware of other variables (other than the *explanatory variable*) that may influence the result, they may first group individuals based on this variable into **blocks** and then randomize cases within each block. Blocking is like *stratifying*.

### Experimental Terminology

**Placebos** are fake treatments administered to the control group in medical studies.

**Blinding** is when experimental units do not know if they are in the control or treatment groups.

**Double-blinding** is when both experimental units and the researchers do not know whoβs in the control andwhoβs in the treatment group.

## Random Assignment vs Random Sampling

X | Random Assignment | No Random Assignment |
---|---|---|

Random Sampling |
Causal and Generalizable (Ideal Experiment) | Not Causal but Generalizable (Most Observational Studies) |

No Random Sampling |
Causal but not Generalizable (Most Experiments) | Neither Causal nor Generalizable (Bad Observational Studies) |

## Variance

**Variance** is roughly the average squared deviation from the mean.

We use the squared deviation to get rid of negatives, so that observations equally distant from the mean are weighted equally, and to weigh larger deviations more heavily.

## Standard Deviation

**Standard Deviation** is the square root of the variance.

It has the same units as the data.

## Interquartile Range

The 25th percentile is also called the **first quartile** or **Q1**.

The 50th percentile is also called the **median**.

The 75th percentile is also called the **third quartile** or **Q3**.

The range between **Q1** and **Q3** is called the **Interquartile Range** or **IQR**.

## Mean vs Median

If the distribution is skewed, we can determine if it is right-skewed or left-skewedf based on the difference in the mean and median.

**Right-skewed** is when the mean > the median.

**Left-skewed** is when the mean < the median.

## Null vs Alternate Hypothesis

To determine there is something interesting in your data, you must first prove there wasnβt nothing going on with your data.

The hypothesis that nothing interesting is happening is the **Null Hypothesis**. This represents the status quo.

Once you have proven the **Null Hypothesis** is false, you may examine the **Alternate Hypothesis**.

The **Alternative Hypothesis** represents the research question weβre testing for.

## Normal Distribution

A normal distribution is a unimodal and symmetric, bell shaped curve.

We write this as:

π: mean

π: Standard Deviation

```
# A Graph of a normal distribution of mean 3 and standard deviation 5
using Plots
using StatPlots
using Distributions
gr()
plot(Normal(3,5), title="Normal Distribution N(3,5)", lw=3,)
```

```
β Info: Precompiling StatPlots [60ddc479-9b66-56df-82fc-76a74619b69c]
β @ Base loading.jl:1186
```

Lots of things have nearly normal distributions. SAT scores are distributed nearly normally with mean 1500 and standard deviation of 300.

```
# A graph of the SAT distribution
π=1500
π=300
plot(Normal(π,π), title="Normal Distribution N($(π),$(π))", lw=3, label=["SAT Distribution"])
vline!([1200,1800], label="1 Standard Deviation")
vline!([900, 2100], label="2 Standard Deviations")
vline!([600, 2400], label="3 Standard Deviations")
```

## Z Scores

Z scores are a way of standardizing scores across normal distributions.

We could use them, for example, to compare an ACT and SAT score, both of which have nearly normal distributions.

## Percentiles

A **percentile** is the percentage of observations that fall below a given datapoint.

Graphically, percentile is the area below the probability distribution curve to the left of that observation.

```
# A graph of the SAT distribution
function signifChop(num, digits)
if num == 0.0 then
return num
else
e = ceil(log10(abs(num)))
scale = 10^(digits - e)
return trunc(num * scale) / scale
end
end
π=1500
π=300
using Rmath
percentile = signifChop(pnorm(1800,π,π)*100,3)
plot(Normal(π,π), title="Normal Distribution N($(π),$(π)) to $(percentile) percentile", lw=3, label=["SAT Distribution"], fillrange=0)
xlims!((0,1800))
```

```
# We can calculate percentiles with R/Julia:
pnorm(1800,π,π)
```

```
0.8413447460685429
```

## Ceneral Limit Theorem

In some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.

This means we can use statistical methods on normal distributions to analyze data that does not begin as a normal distribution.

### Conditions:

**Independence**: Sampled observations must be independent.- Random sample/assignment.
- If sampling without replacement, then n must be less than 10% of the population.

**Sample and Skew**: Either the population distribution is normal, or the sample size must be large (usally n>30).

## Margin of Error

## Confidence Interval for difference between two means

All confideence intervals stay the same:

Our point estimate becomes:

And our standard error changes slightly:

## p-value

A **p-value** is the probability of observing data at least as favorable to $H_A$ as our current data set, if in fact $H_0$ were true.

If a **p-value** is *low* (usually lower than 5%) then we are able to *reject* $H_0$.

### Calculatig a p-value

The **p-value** can be calculated as the percentile of the normal distribution given $\bar{x}$, $\sigma$, and $\mu$.:

We can also represent this in terms of Z:

We can also implement this in Julia/R:

```
# Note the '1-' to account for the > rather than <.
1-pnorm(9.7,8,0.5)
```

```
0.0003369292656768552
```

### Simulating for a p-value

We can compare our results to a simulation to calculate a p-value:

```
function simulate(success_count, fail_count)
g1 = []
g2 = []
for i in 1:success_count
if rand(1:2) == 1
append!(g1, 'A')
else
append!(g2, 'A')
end
end
for i in 1:fail_count
if rand(1:2) == 1
append!(g1, 'B')
else
append!(g2, 'B')
end
end
return (g1, g2)
end
differences = []
simulation_count = 10000
for i in 1:simulation_count
g1, g2 = simulate(35, 13)
append!(differences, (length(findall(g1 .== 'A'))/length(g1))-(length(findall(g2 .== 'A'))/length(g2)))
end
gr()
histogram(differences, bins=:scott, labels=["difference"])
plot!(title = "Frequency of Difference over $(simulation_count) simulations")
```

```
# Calculate P value
length(findall(x -> (x >= 0.3) || (x <= -0.3),differences))/length(differences)
```

```
0.0229
```

## t-distribution

When the sample size is too small to use Centeral Limit Thorem, we need to use t-distributions.

They are always centered at zero.

They have a single parameter, **degrees of freedom** (*df*).

```
plot(Normal(0,1), label="Normal(0,1)")
plot!(TDist(2), label="TDist(2)")
plot!(TDist(5), label="TDist(5)")
plot!(TDist(10), label="TDist(10)")
xlims!(-4,4)
```

The p value is still calculated as the area under the t-distribution.

```
using Rmath
2*pt(4.94,9, false)
```

```
0.0008022393577614288
```

## Sources

- Slides from Mine Γetinkaya-Rundel of OpenIntro adapted under a CC BY-SA license by Manjeet Rege.