statistics - CJKinni's Notes
https://cjkinni.com/statistics/index.xml
Hugo -- gohugo.ioen-usSat, 10 Nov 2018 00:00:00 +0000Associated vs Independent Variables
https://cjkinni.com/statistics/associated-vs-independent-variables/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/associated-vs-independent-variables/When two variables show some connection with one another, they are called associated (or dependent) variables.
If two variables are not associated, then they are independent.Centeral Limit Theorem
https://cjkinni.com/statistics/centeral-limit-theorem/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/centeral-limit-theorem/In some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
This means we can use statistical methods on normal distributions to analyze data that does not begin as a normal distribution.
Conditions: Independence: Sampled observations must be independent. Random sample/assignment. If sampling without replacement, then n must be less than 10% of the population.Confidence Interval for Difference Between Means
https://cjkinni.com/statistics/confidence-interval-for-difference-between-means/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/confidence-interval-for-difference-between-means/All confideence intervals stay the same:
$$Confidence Interval = Point Estimate\pm ME$$
$$ME=Critical Value * SE of Point Estimate$$
Our point estimate becomes:
$$Point Estimate=\bar{x}_1-\bar{x}_2$$
And our standard error changes slightly:
$$SE_(\bar{x}_1-\bar{x}_2)=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$$Explanatory and Response Variables
https://cjkinni.com/statistics/explanatory-and-response-variables/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/explanatory-and-response-variables/An explanatory variable is one variable in a pair of variables which we suspect affects the other. Labeling a variable as explanatory of another variable does not guarantee the relationship between the two is actually causal. It is merely a label so we can keep track of which variables we suspect affects the other.
In a pair of variables, where one is an explanatory variable, the other is the response variable.Interquartile Range
https://cjkinni.com/statistics/interquartile-range/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/interquartile-range/The 25th percentile is also called the first quartile or Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile or Q3.
The range between Q1 and Q3 is called the Interquartile Range or IQR.Margin of Error
https://cjkinni.com/statistics/margin-of-error/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/margin-of-error/$$ME=Z^*\frac{s}{\sqrt{n}}$$Mean vs Median
https://cjkinni.com/statistics/mean-vs-median/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/mean-vs-median/If the distribution is skewed, we can determine if it is right-skewed or left-skewedf based on the difference in the mean and median.
Right-skewed is when the mean > the median.
Left-skewed is when the mean < the median.Normal Distribution
https://cjkinni.com/statistics/normal-distribution/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/normal-distribution/A normal distribution is a unimodal and symmetric, bell shaped curve.
We write this as:
$$N(\mu, \sigma)$$
𝜇: mean
𝜎: Standard Deviation
# A Graph of a normal distribution of mean 3 and standard deviation 5 using Plots using StatPlots using Distributions gr() plot(Normal(3,5), title="Normal Distribution N(3,5)", lw=3,) ┌ Info: Precompiling StatPlots [60ddc479-9b66-56df-82fc-76a74619b69c] └ @ Base loading.jl:1186 Lots of things have nearly normal distributions. SAT scores are distributed nearly normally with mean 1500 and standard deviation of 300.Null vs Alternate Hypothesis
https://cjkinni.com/statistics/null-v-alternate-hypothesis/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/null-v-alternate-hypothesis/To determine there is something interesting in your data, you must first prove there wasn’t nothing going on with your data.
The hypothesis that nothing interesting is happening is the Null Hypothesis. This represents the status quo.
Once you have proven the Null Hypothesis is false, you may examine the Alternate Hypothesis.
The Alternative Hypothesis represents the research question we’re testing for.Observational Studies and Experiments
https://cjkinni.com/statistics/observational-studies-and-experiments/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/observational-studies-and-experiments/An observational study occurs when researchers observe subjects, as opposed to imposing a treatment on their subjects.
In an experiment, researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.p-value
https://cjkinni.com/statistics/p-value/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/p-value/A p-value is the probability of observing data at least as favorable to $H_A$ as our current data set, if in fact $H_0$ were true.
If a p-value is low (usually lower than 5%) then we are able to reject $H_0$.
Calculatig a p-value The p-value can be calculated as the percentile of the normal distribution given $\bar{x}$, $\sigma$, and $\mu$.:
$$P(\bar{x}>9.7|\mu=8|\sigma=0.5)=0.0003$$
We can also represent this in terms of Z:Percentiles
https://cjkinni.com/statistics/percentiles/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/percentiles/A percentile is the percentage of observations that fall below a given datapoint.
Graphically, percentile is the area below the probability distribution curve to the left of that observation.
# A graph of the SAT distribution function signifChop(num, digits) if num == 0.0 then return num else e = ceil(log10(abs(num))) scale = 10^(digits - e) return trunc(num * scale) / scale end end 𝜇=1500 𝜎=300 using Rmath percentile = signifChop(pnorm(1800,𝜇,𝜎)*100,3) plot(Normal(𝜇,𝜎), title="Normal Distribution N($(𝜇),$(𝜎)) to $(percentile)percentile", lw=3, label=["SAT Distribution"], fillrange=0) xlims!Principles of Experimental Design
https://cjkinni.com/statistics/principles-of-experimental-design/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/principles-of-experimental-design/Principles of Experimental Design Researchers attempt to control for differences between treatment groups.
Researchers randomize patients into treatment groups to account for variables that cannot be controlled.
The more cases we observe, the more we can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. We can also replicate an entire study to verify earlier findings.Sampling and Sources of Bias
https://cjkinni.com/statistics/sampling-and-sources-of-bias/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/sampling-and-sources-of-bias/A census is when we sample the entire population.
It is difficult to take a census.
Sampling Bias There are multiple reasons for sampling bias:
Non-response can occur if only a small fraction of the sample respond to a survey, the sample may no longer be representative of the population.
Voluntary response occurs when the sample consists of people who volunteer to respond becuase they have strong opinions, causing the responses to not be representative of the population.Standard Deviation
https://cjkinni.com/statistics/standard-deviation/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/standard-deviation/Standard Deviation is the square root of the variance.
It has the same units as the data.
$$SD = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}$$t-distribution
https://cjkinni.com/statistics/t-distribution/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/t-distribution/When the sample size is too small to use Centeral Limit Thorem, we need to use t-distributions.
They are always centered at zero.
They have a single parameter, degrees of freedom (df).
plot(Normal(0,1), label="Normal(0,1)") plot!(TDist(2), label="TDist(2)") plot!(TDist(5), label="TDist(5)") plot!(TDist(10), label="TDist(10)") xlims!(-4,4) The p value is still calculated as the area under the t-distribution.
using Rmath 2*pt(4.94,9, false) 0.0008022393577614288 Variance
https://cjkinni.com/statistics/variance/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/variance/Variance is roughly the average squared deviation from the mean.
$$s^2=\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}$$
We use the squared deviation to get rid of negatives, so that observations equally distant from the mean are weighted equally, and to weigh larger deviations more heavily.Z scores
https://cjkinni.com/statistics/z-scores/
Sat, 10 Nov 2018 00:00:00 +0000https://cjkinni.com/statistics/z-scores/Z scores are a way of standardizing scores across normal distributions.
$$Z = \frac{(observation-mean)}{SD}$$
$$Z = \frac{(observation -mean)}{\sqrt{\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}}$$
We could use them, for example, to compare an ACT and SAT score, both of which have nearly normal distributions.