Looking at some univariate data: summary statistics and histograms
|
Univariate data can be plotted using histograms, e.g. with matplotlib.pyplot.hist . Histograms can also be calculated (without plotting) using numpy.hist .
Pre-binned data can be plotted using matplotlib.pyplot.hist using weights matching the binned frequencies/densities and bin centres used as dummy values to be binned.
Statistical errors are due to random measurement errors or randomness of the population being sampled, while systematic errors are non-random and linked to faults in the experiment or limitations/biases in sample collection.
Precise measurements have low relative statistical error, while accurate measurements have low relative systematic error.
Data distributions can be quantified using sample statistics such as the mean and median and variance or standard-deviation (quantifying the width of the distribution), e.g. with numpy functions mean , median , var and std . Remember to check the degrees of freedom assumed for the variance and standard deviation functions!
Quantities calculated from data such as such as mean, median and variance are statistics. Hypotheses about the data can be tested by comparing a suitable test statistic with its expected distribution, given the hypothesis and appropriate assumptions.
|
Introducing probability distributions
|
Probability distributions show how random variables are distributed. Two common distributions are the uniform and normal distributions.
Uniform and normal distributions and many associated functions can be accessed using scipy.stats.uniform and scipy.stats.norm respectively.
The probability density function (pdf) shows the distribution of relative likelihood or frequency of different values of a random variable and can be accessed with the scipy statistical distribution’s pdf method.
The cumulative distribution function (cdf) is the integral of the pdf and shows the cumulative probability for a variable to be equal to or less than a given value. It can be accessed with the scipy statistical distribution’s cdf method.
Quantiles such as percentiles and quartiles give the values of the random variable which correspond to fixed probability intervals (e.g. of 1 per cent and 25 per cent respectively). They can be calculated for a distribution in scipy using the percentile or interval methods.
The percent point function (ppf) (ppf method) is the inverse function of the cdf and shows the value of the random variable corresponding to a given quantile in its distribution.
Probability distributions are defined by common types of parameter such as the location and scale parameters. Some distributions also include shape parameters.
|
Random variables
|
Random variables are drawn from probability distributions. The expectation value (arithmetic mean for an infinite number of sampled variates) is equal to the mean of the distribution function (pdf).
The expectation of the variance of a random variable is equal to the expectation of the squared variable minus the squared expectation of the variable.
Sums of scaled random variables have expectation values equal to the sum of scaled expectations of the individual variables, and variances equal to the sum of scaled individual variances.
The means and variances of summed random variables lead to the calculation of the standard error (the standard deviation) of the mean.
scipy.stats distributions have methods to calculate the mean (.mean ), variance (.var ) and other properties of the distribution.
scipy.stats distributions have a method (.rvs ) to generate arrays of random variates drawn from that distribution.
|
The Central Limit Theorem
|
Sums of samples of random variates from non-normal distributions with finite mean and variance, become asymptotically normally distributed as their sample size increases.
The theorem holds for sums of differently distributed variates, but the speed at which a normal distribution is approached depends on the shape of the variate’s distribution, with symmetric distributions approaching the normal limit faster than asymmetric distributions.
Means of large numbers (e.g. 100 or more) of non-normally distributed measurements are distributed close to normal, with distribution mean equal to the population mean that the measurements are drawn from and standard deviation given by the standard error on the mean.
Distributions of means (or other types of sum) of non-normal random data are closer to normal in their centres than in the tails of the distribution, so the normal assumption is most reliable for smaller deviations of sample mean from the population mean.
|
Significance tests: the z-test - comparing with a population of known mean and variance
|
Significance testing is used to determine whether a given (null) hypothesis is rejected by the data, by calculating a test statistic and comparing it with the distribution expected for it, under the assumption that the null hypothesis is true.
A null hypothesis is formulated from a physical model (with parameters that are fixed and independent of the experiment) and a statistical model (which governs the probability distribution of the test statistic). Additional assumptions may be required to derive the distribution of the test statistic.
A null hypothesis is rejected if the measured p-value of the test statistic is equal to or less than a pre-defined significance level.
Rejection of the null hypothesis could indicate rejection of either the physical model or the statistical model (or both), with further experiments or tests required to determine which.
For comparing measurements with an expected (population mean) value, a z-statistic can be calculated to compare the sample mean with the expected value, normalising by the standard error on the sample mean, which requires knowledge of the variance of the population that the measurements are sampled from.
The z-statistic should be distributed as a standard normal provided that the sample mean is normally distributed, which may arise for large samples from the central limit theorem, or for any sample size if the measurements are drawn from normal distributions.
|
Significance tests: the t-test - comparing means when population variance is unknown
|
A t-statistic can be defined from the sample mean and its standard error, which is distributed following a t-distribution, If the sample mean is normally distributed and sample variance is distributed as a scaled chi-squared distribution.
The one-sample t-test can be used to compare a sample mean with a population mean when the population variance is unknown, as is often the case with experimental statistical errors.
The two-sample t-test can be used to compare two sample means, to see if they could be drawn from distributions with the same population mean and either the same or different variances (e.g. to compare measurements of the same quantity obtained with different experiments).
Caution must be applied when interpreting t-test significances of more than 2 to 3 sigma unless the sample is large or the measurements themselves are known to be normally distributed.
|
Discrete random variables and their probability distributions
|
Discrete probability distributions map a sample space of discrete outcomes (categorical or numerical) on to their probabilities.
By assigning an outcome to an ordered sequence of integers corresponding to the discrete variates, functional forms for probability distributions (the pmf or probability mass function) can be defined.
Bernoulli trials correspond to a single binary outcome (success/fail) while the number of successes in repeated Bernoulli trials is given by the binomial distribution.
The Poisson distribution can be derived as a limiting case of the binomial distribution and corresponds to the probability of obtaining a certain number of counts in a fixed interval, from a random process with a constant rate.
Counts in fixed histogram bins follow Poisson statistics.
In the limit of large numbers of successes/counts, the binomial and Poisson distributions approach the normal distribution.
|
Probability calculus and conditional probability
|
A sample space contains all possible mutually exclusive outcomes of an experiment or trial.
Events consist of sets of outcomes which may overlap, leading to conditional dependence of the occurrence of one event on another. The conditional dependence of events can be described graphically using Venn diagrams.
Two events are independent if their probability does not depend on the occurrence (or not) of the other event. Events are mutually exclusive if the probability of one event is zero given that the other event occurs.
The probability of an event A occurring, given that B occurs, is in general not equal to the probability of B occurring, given that A occurs.
Calculations with conditional probabilities can be made using the probability calculus, including the addition rule, multiplication rule and extensions such as the law of total probability.
Multivariate probability distributions can be understood using the mathematics of conditional probability.
|
Reading, working with and plotting multivariate data
|
The Pandas module is an efficient way to work with complex multivariate data, by reading in and writing the data to a dataframe, which is easier to work with than a numpy structured array.
Pandas functionality can be used to clean dataframes of bad or missing data, while scipy and numpy functions can be applied to columns of the dataframe, in order to modify or transform the data.
Scatter plot matrices and 3-D plots offer powerful ways to plot and explore multi-dimensional data.
|
Correlation tests and least-squares fitting
|
The sample covariance between two variables is an unbiased estimator for population covariance and shows the part of variance that is produced by linearly related variations in both variables.
Normalising the sample covariance by the sample standard deviations in both bands yields Pearson’s correlation coefficient, r.
Spearman’s rho correlation coefficient is based on the correlation in the ranking of variables, not their absolute values, so is more robust to outliers than Pearson’s coefficient.
By assuming that the data are independent (and thus uncorrelated) and identically distributed, significance tests can be carried out on the hypothesis of no correlation, provided the sample is large (\(n>500\)) and/or is normally distributed.
By minimising the squared differences between the data and a linear model, linear regression can be used to obtain the model parameters.
Bootstrapping uses resampling (with replacement) of the data to estimate the standard deviation of any model parameters or other quantities calculated from the data.
|
Bayes' Theorem
|
For conditional probabilities, Bayes’ theorem tells us how to swap the conditionals around.
In statistical terminology, the probability distribution of the hypothesis given the data is the posterior and is given by the likelihood multiplied by the prior probability, divided by the evidence.
The likelihood is the probability to obtained fixed data as a function of the distribution parameters, in contrast to the pdf which obtains the distribution of data for fixed parameters.
The prior probability represents our prior belief that the hypothesis is correct, before collecting the data
The evidence is the total probability of obtaining the data, marginalising over viable hypotheses. It is usually the most difficult quantity to calculate unless simplifying assumptions are made.
|
Maximum likelihood estimation
|
|
Fitting models to data
|
|