Introducing probability calculus and conditional probability
|
Probability can be thought of in terms of hypothetical frequencies of outcomes over an infinite number of trials (Frequentist), or as a belief in the likelihood of a given outcome (Bayesian).
A sample space contains all possible mutually exclusive outcomes of an experiment or trial.
Events consist of sets of outcomes which may overlap, leading to conditional dependence of the occurrence of one event on another. The conditional dependence of events can be described graphically using Venn diagrams.
Two events are independent if their probability does not depend on the occurrence (or not) of the other event. Events are mutually exclusive if the probability of one event is zero given that the other event occurs.
The probability of an event A occurring, given that B occurs, is in general not equal to the probability of B occurring, given that A occurs.
Calculations with conditional probabilities can be made using the probability calculus, including the addition rule, multiplication rule and extensions such as the law of total probability.
|
Discrete random variables and their probability distributions
|
Discrete probability distributions map a sample space of discrete outcomes (categorical or numerical) on to their probabilities.
By assigning an outcome to an ordered sequence of integers corresponding to the discrete variates, functional forms for probability distributions (the pmf or probability mass function) can be defined.
Random variables are drawn from probability distributions. The expectation value (arithmetic mean for an infinite number of sampled variates) is equal to the mean of the distribution function (pmf or pdf).
The expectation of the variance of a random variable is equal to the expectation of the squared variable minus the squared expectation of the variable.
Sums of scaled random variables have expectation values equal to the sum of scaled expectations of the individual variables, and variances equal to the sum of scaled individual variances.
Bernoulli trials correspond to a single binary outcome (success/fail) while the number of successes in repeated Bernoulli trials is given by the binomial distribution.
The Poisson distribution can be derived as a limiting case of the binomial distribution and corresponds to the probability of obtaining a certain number of counts in a fixed interval, from a random process with a constant rate.
Random variates can be sampled from Scipy probability distributions using the .rvs method.
The probability distribution of numbers of objects for a given bin/classification depends on whether the original sample size was fixed at a pre-determined value or not.
|
Continuous random variables and their probability distributions
|
Probability distributions show how random variables are distributed. Three common continuous distributions are the uniform, normal and lognormal distributions.
The probability density function (pdf) and cumulative distribution function (cdf) can be accessed for scipy statistical distributions via the pdf and cdf methods.
Probability distributions are defined by common types of parameter such as the location and scale parameters. Some distributions also include shape parameters.
The shape of a distribution can be empirically quantified using its statistical moments, such as the mean, variance, skewness (asymmetry) and kurtosis (strength of the tails).
Quantiles such as percentiles and quartiles give the values of the random variable which correspond to fixed probability intervals (e.g. of 1 per cent and 25 per cent respectively). They can be calculated for a distribution in scipy using the percentile or interval methods.
The percent point function (ppf) (ppf method) is the inverse function of the cdf and shows the value of the random variable corresponding to a given quantile in its distribution.
The means and variances of summed random variables lead to the calculation of the standard error (the standard deviation) of the mean.
Sums of samples of random variates from non-normal distributions with finite mean and variance, become asymptotically normally distributed as their sample size increases. The speed at which a normal distribution is approached depends on the shape/symmetry of the variate’s distribution.
Distributions of means (or other types of sum) of non-normal random data are closer to normal in their centres than in the tails of the distribution, so the normal assumption is most reliable for data that are closer to the mean.
|
Joint probability distributions
|
Joint probability distributions show the joint probability of two or more variables occuring with given values.
The univariate pdf of one of the variables can be obtained by marginalising (integrating) the joint pdf over the other variable(s).
If the probability of a variable taking on a value is conditional on the value of the other variable (i.e. the variables are not independent), the joint pdf will appear tilted.
The covariance describes the linear relationship between two variables, when normalised by their standard deviations it gives the correlation coefficient between the variables.
Zero covariance and correlation coefficient arises when the two variables are independent and may also occur when the variables are non-linearly related.
The covariance matrix gives the covariances between different variables as off-diagonal elements, with their variances given along the diagonal of the matrix.
The distributions of sums of multivariate random variates have vectors of means and covariance matrices equal to the sum of the vectors of means and covariance matrices of the individual distributions.
The sums of multivariate random variates also follow the (multivariate) central limit theorem, asymptotically following multivariate normal distributions for sums of large samples.
|
Bayes' Theorem
|
For conditional probabilities, Bayes’ theorem tells us how to swap the conditionals around.
In statistical terminology, the probability distribution of the hypothesis given the data is the posterior and is given by the likelihood multiplied by the prior probability, divided by the evidence.
The likelihood is the probability to obtained fixed data as a function of the distribution parameters, in contrast to the pdf which obtains the distribution of data for fixed parameters.
The prior probability represents our prior belief that the hypothesis is correct, before collecting the data
The evidence is the total probability of obtaining the data, marginalising over viable hypotheses. For complex data and/or models, it can be the most difficult quantity to calculate unless simplifying assumptions are made.
The choice of prior can determine how much data is required to converge on the value of a parameter (i.e. to produce a narrow posterior probability distribution for that parameter).
|
Working with and plotting large multivariate data sets
|
The Pandas module is an efficient way to work with complex multivariate data, by reading in and writing the data to a dataframe, which is easier to work with than a numpy structured array.
Pandas functionality can be used to clean dataframes of bad or missing data, while scipy and numpy functions can be applied to columns of the dataframe, in order to modify or transform the data.
Scatter plot matrices and 3-D plots offer powerful ways to plot and explore multi-dimensional data.
|
Introducing significance tests and comparing means
|
Significance testing is used to determine whether a given (null) hypothesis is rejected by the data, by calculating a test statistic and comparing it with the distribution expected for it, under the assumption that the null hypothesis is true.
A null hypothesis is rejected if the measured p-value of the test statistic is equal to or less than a pre-defined significance level.
For comparing measurements with what is expected from a given (population) mean value and variance, a Z-statistic can be calculated, which should be distributed as a standard normal provided that the sample mean is normally distributed.
When the population variance is not known, a t-statistic can be defined from the sample mean and its standard error, which is distributed following a t-distribution, if the sample mean is normally distributed and sample variance is distributed as a scaled chi-squared distribution.
The one-sample t-test can be used to compare a sample mean with a population mean when the population variance is unknown, as is often the case with experimental statistical errors.
The two-sample t-test can be used to compare two sample means, to see if they could be drawn from distributions with the same population mean.
Caution must be applied when interpreting t-test significances of more than several sigma unless the sample is large or the measurements themselves are known to be normally distributed.
|
Multivariate data - correlation tests and least-squares fitting
|
The sample covariance between two variables is an unbiased estimator for population covariance and shows the part of variance that is produced by linearly related variations in both variables.
Normalising the sample covariance by the product of the sample standard deviations of both variables, yields Pearson’s correlation coefficient, r.
Spearman’s rho correlation coefficient is based on the correlation in the ranking of variables, not their absolute values, so is more robust to outliers than Pearson’s coefficient.
By assuming that the data are independent (and thus uncorrelated) and identically distributed, significance tests can be carried out on the hypothesis of no correlation, provided the sample is large (\(n>500\)) and/or is normally distributed.
By minimising the squared differences between the data and a linear model, linear regression can be used to obtain the model parameters.
|
Confidence intervals, errors and bootstrapping
|
Confidence intervals and upper limits on model parameters can be calculated by integrating the posterior probability distribution so that the probability within the interval or limit bounds matches the desired significance of the interval/limit.
While upper limit integrals are taken from the lowest value of the distribution upwards, confidence intervals are usually centred on the median (\(P=0.5\)) for asymmetric distributions, to ensure that the full probability is enclosed.
If confidence intervals (or equivalently, error bars) are required for some function of random variable, they can be calculated using the transformation of variables method, based on the fact that a transformed range of the variable contains the same probability as the original posterior pdf.
A less accurate approach for obtaining errors for functions of random variables is to use propagation of errors to estimate transformed error bars, however this method implicitly assumes zero covariance between the combined variable errors and assumes that 2nd order and higher derivatives of the new variable w.r.t the original variables are negligible, i.e. the function cannot be highly non-linear.
Bootstrapping (resampling a data set with replacement, many times) offers a simple but effective way to calculate relatively low-significance confidence intervals (e.g. 1- to 2-sigma) for tens to hundreds of data values and complex transformations or calculations with the data. Higher significances require significantly larger data sets and numbers of bootstrap realisations to compute.
|
Maximum likelihood estimation and weighted least-squares model fitting
|
Given a set of data and a model with free parameters, the best unbiased estimators of the model parameters correspond to the maximum likelihood and are called Maximum Likelihood Estimators (MLEs).
In the case of normally-distributed data, the log-likelihood is formally equivalent to the weighted least squares statistic (also known as the chi-squared statistic).
MLEs can be obtained by maximising the (log-)likelihood or minimising the weighted least squares statistic (chi-squared minimisation).
The Python package lmift can be used to fit data efficiently, and the leastsq minimisation method is optimised to carry out weighted least-squares fitting of models to data.
The errors on MLEs can be estimated from the diagonal elements of the covariance matrix obtained from the fit, if the fitting method returns it. These errors are returned directly in the case of lmfit using
|
Confidence intervals on MLEs and fitting binned Poisson event data
|
For normally distributed MLEs, confidence intervals and regions can be calculated by finding the parameter values on either side of the MLE where the weighted least squares (or log-likelihood) gets larger (smaller) by a fixed amount, determined by the required confidence level and the chi-squared distribution (multiplied by 0.5 for log-likelihood) for degrees of freedom equal to the dimensionality of the confidence region (usually 1 or 2).
Confidence regions may be found using brute force grid search, although this is not efficient for joint confidence regions with multiple dimensions, in which case Markov Chain Monte Carlo fitting should be considered.
Univariate data are typically binned into histograms (e.g. count distributions) and the models used to fit these data should be binned in the same way.
If count distributions are binned to at least 20 counts/bin the errors remain close to normally distributed, so that weighted least squares methods may be used to fit the data and a goodness of fit obtained in the usual way. Binned data with fewer counts/bin should be fitted using minimisation of negative log-likelihood. The same approach can be used for other types of data which are not normally distributed about the ‘true’ values.
|
Likelihood ratio: model comparison and confidence intervals
|
A choice between two hypotheses can be informed by the likelihood ratio, the ratio of posterior pdfs expressed as a function of the possible values of data, e.g. a test statistic.
Statistical significance is the chance that a given pre-specified value (or more extreme value) for a test statistic would be observed if the null hypothesis is true. Set as a significance level, it represents the chance of a false positive, where we would reject a true null hypothesis in favour of a false alternative.
When a significance level has been pre-specified, the statistical power of the test is the chance that something less extreme than the pre-specified test-statistic value would be observed if the alternative hypothesis is true. It represents the chance of rejecting a true alternative and accept a false null hypothesis, i.e. the chance of a false negative.
The Neyman-Pearson Lemma together with Wilks’ theorem show how the log-likelihood ratio between an alternative hypothesis and nested (i.e. with more parameter constraints) null hypothesis allows the statistical power of the comparison to be maximised for any given significance level.
Provided that the MLEs being considered in the alternative (fewer constraints) model are normally distributed, we can use the delta-log-likelihood or delta-chi-squared to compare the alternative with the more constrained null model.
The above approach can be used to calculate confidence intervals or upper/lower limits on parameters, determine whether additional model components are required and test which (if any) parameters are significantly different between multiple datasets.
For testing significance of narrow additive model components such as emission or absorption lines, only the line normalisation can be considered a nested parameter provided it is allowed to vary without constraints in the best fit. The significance should therefore be corrected using e.g. the Bonferroni correction, to account for the energy range searched over for the feature.
|