Statistical inference - a practical approach

Key Points

Introducing probability calculus and conditional probability	Probability can be thought of in terms of hypothetical frequencies of outcomes over an infinite number of trials (Frequentist), or as a belief in the likelihood of a given outcome (Bayesian). A sample space contains all possible mutually exclusive outcomes of an experiment or trial. Events consist of sets of outcomes which may overlap, leading to conditional dependence of the occurrence of one event on another. The conditional dependence of events can be described graphically using Venn diagrams. Two events are independent if their probability does not depend on the occurrence (or not) of the other event. Events are mutually exclusive if the probability of one event is zero given that the other event occurs. The probability of an event A occurring, given that B occurs, is in general not equal to the probability of B occurring, given that A occurs. Calculations with conditional probabilities can be made using the probability calculus, including the addition rule, multiplication rule and extensions such as the law of total probability.
Discrete random variables and their probability distributions	Discrete probability distributions map a sample space of discrete outcomes (categorical or numerical) on to their probabilities. By assigning an outcome to an ordered sequence of integers corresponding to the discrete variates, functional forms for probability distributions (the pmf or probability mass function) can be defined. Random variables are drawn from probability distributions. The expectation value (arithmetic mean for an infinite number of sampled variates) is equal to the mean of the distribution function (pmf or pdf). The expectation of the variance of a random variable is equal to the expectation of the squared variable minus the squared expectation of the variable. Sums of scaled random variables have expectation values equal to the sum of scaled expectations of the individual variables, and variances equal to the sum of scaled individual variances. Bernoulli trials correspond to a single binary outcome (success/fail) while the number of successes in repeated Bernoulli trials is given by the binomial distribution. The Poisson distribution can be derived as a limiting case of the binomial distribution and corresponds to the probability of obtaining a certain number of counts in a fixed interval, from a random process with a constant rate. Random variates can be sampled from Scipy probability distributions using the `.rvs` method. The probability distribution of numbers of objects for a given bin/classification depends on whether the original sample size was fixed at a pre-determined value or not.
Continuous random variables and their probability distributions	Probability distributions show how random variables are distributed. Three common continuous distributions are the uniform, normal and lognormal distributions. The probability density function (pdf) and cumulative distribution function (cdf) can be accessed for scipy statistical distributions via the `pdf` and `cdf` methods. Probability distributions are defined by common types of parameter such as the location and scale parameters. Some distributions also include shape parameters. The shape of a distribution can be empirically quantified using its statistical moments, such as the mean, variance, skewness (asymmetry) and kurtosis (strength of the tails). Quantiles such as percentiles and quartiles give the values of the random variable which correspond to fixed probability intervals (e.g. of 1 per cent and 25 per cent respectively). They can be calculated for a distribution in scipy using the `percentile` or `interval` methods. The percent point function (ppf) (`ppf` method) is the inverse function of the cdf and shows the value of the random variable corresponding to a given quantile in its distribution. The means and variances of summed random variables lead to the calculation of the standard error (the standard deviation) of the mean. Sums of samples of random variates from non-normal distributions with finite mean and variance, become asymptotically normally distributed as their sample size increases. The speed at which a normal distribution is approached depends on the shape/symmetry of the variate’s distribution. Distributions of means (or other types of sum) of non-normal random data are closer to normal in their centres than in the tails of the distribution, so the normal assumption is most reliable for data that are closer to the mean.
Joint probability distributions	Joint probability distributions show the joint probability of two or more variables occuring with given values. The univariate pdf of one of the variables can be obtained by marginalising (integrating) the joint pdf over the other variable(s). If the probability of a variable taking on a value is conditional on the value of the other variable (i.e. the variables are not independent), the joint pdf will appear tilted. The covariance describes the linear relationship between two variables, when normalised by their standard deviations it gives the correlation coefficient between the variables. Zero covariance and correlation coefficient arises when the two variables are independent and may also occur when the variables are non-linearly related. The covariance matrix gives the covariances between different variables as off-diagonal elements, with their variances given along the diagonal of the matrix. The distributions of sums of multivariate random variates have vectors of means and covariance matrices equal to the sum of the vectors of means and covariance matrices of the individual distributions. The sums of multivariate random variates also follow the (multivariate) central limit theorem, asymptotically following multivariate normal distributions for sums of large samples.
Bayes' Theorem	For conditional probabilities, Bayes’ theorem tells us how to swap the conditionals around. In statistical terminology, the probability distribution of the hypothesis given the data is the posterior and is given by the likelihood multiplied by the prior probability, divided by the evidence. The likelihood is the probability to obtained fixed data as a function of the distribution parameters, in contrast to the pdf which obtains the distribution of data for fixed parameters. The prior probability represents our prior belief that the hypothesis is correct, before collecting the data The evidence is the total probability of obtaining the data, marginalising over viable hypotheses. For complex data and/or models, it can be the most difficult quantity to calculate unless simplifying assumptions are made. The choice of prior can determine how much data is required to converge on the value of a parameter (i.e. to produce a narrow posterior probability distribution for that parameter).
Working with and plotting large multivariate data sets	The Pandas module is an efficient way to work with complex multivariate data, by reading in and writing the data to a dataframe, which is easier to work with than a numpy structured array. Pandas functionality can be used to clean dataframes of bad or missing data, while scipy and numpy functions can be applied to columns of the dataframe, in order to modify or transform the data. Scatter plot matrices and 3-D plots offer powerful ways to plot and explore multi-dimensional data.
Introducing significance tests and comparing means	Significance testing is used to determine whether a given (null) hypothesis is rejected by the data, by calculating a test statistic and comparing it with the distribution expected for it, under the assumption that the null hypothesis is true. A null hypothesis is rejected if the measured p-value of the test statistic is equal to or less than a pre-defined significance level. For comparing measurements with what is expected from a given (population) mean value and variance, a Z-statistic can be calculated, which should be distributed as a standard normal provided that the sample mean is normally distributed. When the population variance is not known, a t-statistic can be defined from the sample mean and its standard error, which is distributed following a t-distribution, if the sample mean is normally distributed and sample variance is distributed as a scaled chi-squared distribution. The one-sample t-test can be used to compare a sample mean with a population mean when the population variance is unknown, as is often the case with experimental statistical errors. The two-sample t-test can be used to compare two sample means, to see if they could be drawn from distributions with the same population mean. Caution must be applied when interpreting t-test significances of more than several sigma unless the sample is large or the measurements themselves are known to be normally distributed.
Multivariate data - correlation tests and least-squares fitting	The sample covariance between two variables is an unbiased estimator for population covariance and shows the part of variance that is produced by linearly related variations in both variables. Normalising the sample covariance by the product of the sample standard deviations of both variables, yields Pearson’s correlation coefficient, r. Spearman’s rho correlation coefficient is based on the correlation in the ranking of variables, not their absolute values, so is more robust to outliers than Pearson’s coefficient. By assuming that the data are independent (and thus uncorrelated) and identically distributed, significance tests can be carried out on the hypothesis of no correlation, provided the sample is large (\(n>500\)) and/or is normally distributed. By minimising the squared differences between the data and a linear model, linear regression can be used to obtain the model parameters.
Confidence intervals, errors and bootstrapping	Confidence intervals and upper limits on model parameters can be calculated by integrating the posterior probability distribution so that the probability within the interval or limit bounds matches the desired significance of the interval/limit. While upper limit integrals are taken from the lowest value of the distribution upwards, confidence intervals are usually centred on the median (\(P=0.5\)) for asymmetric distributions, to ensure that the full probability is enclosed. If confidence intervals (or equivalently, error bars) are required for some function of random variable, they can be calculated using the transformation of variables method, based on the fact that a transformed range of the variable contains the same probability as the original posterior pdf. A less accurate approach for obtaining errors for functions of random variables is to use propagation of errors to estimate transformed error bars, however this method implicitly assumes zero covariance between the combined variable errors and assumes that 2nd order and higher derivatives of the new variable w.r.t the original variables are negligible, i.e. the function cannot be highly non-linear. Bootstrapping (resampling a data set with replacement, many times) offers a simple but effective way to calculate relatively low-significance confidence intervals (e.g. 1- to 2-sigma) for tens to hundreds of data values and complex transformations or calculations with the data. Higher significances require significantly larger data sets and numbers of bootstrap realisations to compute.
Maximum likelihood estimation and weighted least-squares model fitting	Given a set of data and a model with free parameters, the best unbiased estimators of the model parameters correspond to the maximum likelihood and are called Maximum Likelihood Estimators (MLEs). In the case of normally-distributed data, the log-likelihood is formally equivalent to the weighted least squares statistic (also known as the chi-squared statistic). MLEs can be obtained by maximising the (log-)likelihood or minimising the weighted least squares statistic (chi-squared minimisation). The Python package lmift can be used to fit data efficiently, and the `leastsq` minimisation method is optimised to carry out weighted least-squares fitting of models to data. The errors on MLEs can be estimated from the diagonal elements of the covariance matrix obtained from the fit, if the fitting method returns it. These errors are returned directly in the case of lmfit using
Confidence intervals on MLEs and fitting binned Poisson event data	For normally distributed MLEs, confidence intervals and regions can be calculated by finding the parameter values on either side of the MLE where the weighted least squares (or log-likelihood) gets larger (smaller) by a fixed amount, determined by the required confidence level and the chi-squared distribution (multiplied by 0.5 for log-likelihood) for degrees of freedom equal to the dimensionality of the confidence region (usually 1 or 2). Confidence regions may be found using brute force grid search, although this is not efficient for joint confidence regions with multiple dimensions, in which case Markov Chain Monte Carlo fitting should be considered. Univariate data are typically binned into histograms (e.g. count distributions) and the models used to fit these data should be binned in the same way. If count distributions are binned to at least 20 counts/bin the errors remain close to normally distributed, so that weighted least squares methods may be used to fit the data and a goodness of fit obtained in the usual way. Binned data with fewer counts/bin should be fitted using minimisation of negative log-likelihood. The same approach can be used for other types of data which are not normally distributed about the ‘true’ values.
Likelihood ratio: model comparison and confidence intervals	A choice between two hypotheses can be informed by the likelihood ratio, the ratio of posterior pdfs expressed as a function of the possible values of data, e.g. a test statistic. Statistical significance is the chance that a given pre-specified value (or more extreme value) for a test statistic would be observed if the null hypothesis is true. Set as a significance level, it represents the chance of a false positive, where we would reject a true null hypothesis in favour of a false alternative. When a significance level has been pre-specified, the statistical power of the test is the chance that something less extreme than the pre-specified test-statistic value would be observed if the alternative hypothesis is true. It represents the chance of rejecting a true alternative and accept a false null hypothesis, i.e. the chance of a false negative. The Neyman-Pearson Lemma together with Wilks’ theorem show how the log-likelihood ratio between an alternative hypothesis and nested (i.e. with more parameter constraints) null hypothesis allows the statistical power of the comparison to be maximised for any given significance level. Provided that the MLEs being considered in the alternative (fewer constraints) model are normally distributed, we can use the delta-log-likelihood or delta-chi-squared to compare the alternative with the more constrained null model. The above approach can be used to calculate confidence intervals or upper/lower limits on parameters, determine whether additional model components are required and test which (if any) parameters are significantly different between multiple datasets. For testing significance of narrow additive model components such as emission or absorption lines, only the line normalisation can be considered a nested parameter provided it is allowed to vary without constraints in the best fit. The significance should therefore be corrected using e.g. the Bonferroni correction, to account for the energy range searched over for the feature.

Glossary

accuracy: The relative amount of non-random deviation from the ‘true’ value of a quantity being measured. Measurements of the same quantity but with smaller systematic error are more accurate. See also: precision
argument: A value given to a python function or program when it runs. In python programming, the term is often used interchangeably (and inconsistently) with ‘parameter’, but here we restrict the use of the term ‘parameter’ to probability distributions only.
Bayesian: Approach to understanding probability in terms of the belief in how likely something is to happen, including prior information. Mathematically the approach is described by Bayes’ theorem, which can be used to convert the calculated likelihood of obtaining a particular data set given a hypothesis, into the posterior probability of the hypothesis being correct given the data.
Bessel’s correction: The correction \(\frac{1}{n-1}\) (where \(n\) is sample size) to the arithmetic sum of sample variance so that it becomes an unbiased estimator of the population variance. The value of 1 subtracted from sample size is called the degrees of freedom. The correction compensates for the fact that the part of population variance that leads to variance of the sample mean is already removed from the sample variance, because it is calculated w.r.t. to sample mean and not population mean.
bias: The bias of an estimator is the difference between the expected value of the estimator and the true value of the quantity being estimated.
bivariate: Involving two variates, e.g. bivariate data is a type of data consisting of observations/measurements of two different variables; bivariate analysis studies the relationship between two paired measured variables.
categorical data: A type of data which takes on non-numerical values (e.g. subatomic particle types).
Cauchy-Schwarz inequality: TBD
central limit theorem: The theorem states that under general conditions of finite mean and variance, sums of variates drawn from non-normal distributions will tend towards being normally distributed, asymptotically with sample size \(n\).
cdf: A cumulative distribution function (cdf) gives the cumulative probability that a random variable following a given probability distribution may be less than or equal to a given value, i.e. the cdf gives \(P(X \leq x)\). The cdf is therefore limited to have values over the interval \([0,1]\). For a continuous random variable, the derivative function of a cdf is the pdf. For a discrete random variable, the cdf is the cumulative sum of the pmf.
chi-squared fitting: See weighted least squares
chi-squared test: TBD
conditional probability: If the probability of an event \(A\) depends on the occurence of another event \(B\), \(A\) is said to be conditional on \(B\). The probability of \(A\) happening if \(B\) also happens is denoted \(P(A\vert B)\), i.e. the probability of ‘\(A\) conditional on \(B\)’ or of ‘\(A\) given \(B\)’. See also: independence.
confidence interval: TBD
confidence level: Often used as an alternative form when stating the significance level (\(\alpha\)), it is expressed as 1 minus the significance when quoted as a percentage. E.g. ‘the hypothesis is ruled out at the 95% confidence level’ (for \(\alpha=0.05\)).
confidence region: TBD
continuous: Relating to a continuous random variable, i.e. may take on a continuous and infinite number of possible values within a range specified by the corresponding continuous probability distribution.
correlated variables: TBD
correlation coefficient: TBD
covariance: TBD
covariance matrix: TBD
discrete: Relating to a discrete random variable, i.e. may only take on discrete values (e.g. integers) within a range specified by the corresponding discrete probability distribution.
distributions - Bernoulli: The result of single trial with two outcomes (usually described as ‘success’ and ‘failure’, with probability of success \(\theta\)) follows a Bernoulli distribution. If success is defined as \(X=1\) then a variate distributed as \(X\sim \mathrm{Bern}(\theta)\) has pmf \(p(x\vert \theta) = \theta^{x}(1-\theta)^{1-x} \quad \mbox{for }x=0,1\), with \(E[X]=\theta\) and \(V[X]=\theta(1-\theta)\). See also: binomial distribution.
distributions - binomial: The distribution of number of ‘successes’ \(x\) produced by \(n\) repeated Bernoulli trials with probability of success \(\theta\). \(X\sim \mathrm{Binom}(n,\theta)\) has pmf \(p(x\vert n,\theta) = \frac{n!}{(n-x)!x!} \theta^{x}(1-\theta)^{n-x} \quad \mbox{for }x=0,1,2,...,n.\), with \(E[X]=n\theta\) and \(V[X]=n\theta(1-\theta)\).
distributions - chi-squared: TBD
distributions - lognormal: TBD
distributions - normal: A normally distributed variate \(X\sim N(\mu,\sigma)\) has pdf \(p(x\vert \mu,\sigma)=\frac{1}{\sigma \sqrt{2\pi}} e^{-(x-\mu)^{2}/(2\sigma^{2})}\), and location parameter \(\mu\), scale parameter \(\sigma\). Mean \(E[X]=\mu\) and variance \(V[X]=\sigma^{2}.\) The limiting distribution of sums of random variables (see: central limit theorem). The standard normal distribution has \(\mu=0\), \(\sigma^{2}=1\).
distributions - Poisson: The Poisson distribution gives the probability distribution of counts measured in a fixed interval or bin, assuming that the counts are independent follow a constant mean rate of counts/interval \(\lambda\). For variates \(X\sim \mathrm{Pois}(\lambda)\), the pmf \(p(x \vert \lambda) = \frac{\lambda^{x}e^{-\lambda}}{x!}\), with \(E[X] = \lambda\) and \(V[X] = \lambda\). The Poisson distribution can be derived as a limiting case of the binomial distribution, for an infinite number of trials.
distributions - t: The distribution followed by the \(t\)-statistic, corresponding to the distribution of variates equal to \(T=X/Y\) where \(X\) is drawn from a standard normal distribution and \(Y\) is the square root of a variate drawn from a scaled chi-squared distribution, for a given number of degrees of freedom \(\nu\).
distributions - uniform: \(X\sim U(a,b)\) has pdf \(p(x\vert a,b)=\mathrm{constant}\) on interval \([a,b]\) (and zero elsewhere), and location parameter \(a\), scale parameter \(\lvert b-a \rvert\). Mean \(E[X] = (b+a)/2\) and variance \(V[X] = (b-a)^{2}/12\). Uniform random variates can be used to generate random variates from any other probability distribution via the ppf of that distribution.
error: TBD
estimator: A method for calculating from data an estimate of a given quantity. For example, the sample mean and variance are estimators of the population mean and variance. See also: bias, MLE.
event: In probability theory, an event is an outcome or set of outcomes of a trial to which a probability can be assigned. E.g. an experimental measurement or sample of measurements, a sample of observational data, a dice roll, a ‘hand’ in a card game, a sequence of computer-generated random numbers or the quantity calculated from them.
evidence: TBD
expectation: The expectation value of a quantity, which may be a random variable or a function of a random variable, is (for continuous random variables) the integral (over the variable) of the quantity weighted by the pdf of the variable (or for discrete random variables, the sum weighted by the pmf). In frequentist terms, expectation gives the mean of the random variates (or function of them) in the case of an infinite number of measurements.
false negative: TBD
false positive: TBD
free parameter: TBD
frequentist: Interpretation of probability which defines the probability of an event as the limit of its frequency in many independent trials.
Gaussian approximation: See normal approximation.
goodness of fit: TBD
histogram: A method of plotting the distribution of data by binning (assigning data values) in discrete bins and plotting either the number of values or counts (sometimes denoted frequency) per bin or normalising by the total bin width to give the count density, or further normalising the count density by the total number of counts to give a probability density or sometimes just denoted density).
hypothesis: In statistical terms a hypothesis is a scientific question which is formulated in a way which can be tested using statistical methods. A null hypothesis is a specific kind of baseline hypothesis which is assumed true in order to formulate a statistical test to see whether it is rejected by the data.
hypothesis test: A statistical test, the result of which either rejects a (null) hypothesis to a given significance level (this is also called a significance test) or gives a probability that an alternative hypothesis is preferred over the null hypothesis, to explain the data.
independence: Two events are independent if the outcome of one does not affect the probability of the outcome of the other. Formally, if the events \(A\) and \(B\) are independent, \(P(A\vert B)=P(A)\) and \(P(A \mbox{ and } B)=P(A)P(B)\). See also conditional probability.
interquartile range: The IQR is a form of confidence interval corresponding to the range of data values from the 25th to the 75th percentile.
joint probability distribution: A probability distribution which describes the joint probability of sampling given combinations of variables. Such distributions are commonly known as bivariate (for two variables, i.e. the distribution for a combined sample of two different variates) or multivariate for more than two variables.
kurtosis: TBD
likelihood: TBD
likelihood function: TBD
likelihood ratio: TBD
log-likelihood ratio: TBD
marginalisation: The procedure of removing conditional terms from a probability distribution by summing over them, e.g. for discrete event \(B\) and multiple mutually exclusive events \(A_{i}\), \(P(B)= \sum\limits_{i=1}^{n} P(B\vert A_{i}) P(A_{i})\). For a continuous joint probability distribution, marginalisation corresponds to integration over the conditional parameter, e.g. \(p(x) = \int_{-\infty}^{+\infty} p(x,y)dy = \int_{-\infty}^{+\infty} p(x\vert y)p(y)\mathrm{d}y\).
maximum likelihood estimation: TBD
mean: The (sample) mean \(\bar{x}\) for a quantity \(x_{i}\) measured from a sample of data is a statistic calculated as the average of the quantity, i.e. \(\frac{1}{n} \sum\limits_{i=1}^{n} x_{i}\). For a random variable \(X\) defined by a probability distribution with pdf \(p(x)\), the (population or distribution) mean \(\mu\) is the expectation value of the variable, \(\mu=E[X]=\int^{+\infty}_{-\infty} xp(x)\mathrm{d}x\).
median: The median for a quantity measured from a sample of data, is a statistic calculated as the central value of the ordered values of the quantity. For a random variable defined by a probability distribution, the median corresponds to the value of the 50th percentile of the variable (i.e. with half the total probability below and above the median value).
member: A python variable contained within an object.
method: A python function which is tied to a particular python object. Each of an object’s methods typically implements one of the things it can do, or one of the questions it can answer.
MLE: TBD
mode: The mode is the most frequent value in a sample of data. For a random variable defined by a probability distribution, the mode is the value of the variable corresponding to the peak of the pdf.
moment: Moments are quantities which describe the shape of a function, in statistics the moments describe the shape of a pdf. The first raw moment is the mean \(\mu\), which defines the centre of the pdf, while the central moments \(\mu_{n}\) are defined as \(\mu_{n}=E[(X-\mu)^{n}]\). The second, third and fourth central moments are the variance, skewness and kurtosis, quantifying respectively the distribution’s width or dispersion, its asymmetry and the heaviness of its tails compared to its centre.
multivariate: Involving three or more variates, e.g. multivariate data is a type of data consisting of observations/measurements of three variables; multivariate analysis studies the relationships between three or more variables, to see which are related and how.
mutual exclusivity: Two events are mutually exclusive if they cannot both occur, or equivalently the probability of one occurring is conditional on the other not occurring. I.e. events \(A\) and \(B\) are mutually exclusive if \(P(A \mbox{ and } B)=0\) which occurs if \(P(A\vert B)=0\). For mutually exclusive events, it follows that \(P(A \mbox{ or } B)=P(A)+P(B)\).
normal approximation: TBD
object: A collection of conceptually related python variables (members) and functions using those variables (methods).
ordinal data: A type of categorical data which can be given a relative ordering or ranking but where the differences between ranks are not known or explicitly specified by the categories (e.g. stellar spectral types).
parameter: Probability distributions are defined by parameters which are specific to the distribution, but can be classified according to their effects on the distribution. A location parameter determines the location of the distribution on the variable (\(x\)) axis, with changes shifting the distribution on that axis. A scale parameter determines the width of the distribution and stretches or shrinks it along the \(x\)-axis. Shape parameters do something other than shifting/shrinking/stretching the distribution, changing the distribution shape in some way. Some distributions use a rate parameter, which is the reciprocal of the scale parameter.
pdf: A probability density function (pdf) gives the probability density (i.e. per unit variable) of a continuous probability distribution, i.e. the values of the pdf give the relative probability or frequency of occurrence of values of a random variable. The pdf should be normalised so that the definite integral over all possible values is unity. The integral function of the pdf is the cdf.
percentile: Value of an ordered variable (which may be data) below which a given percentage of the values fall (exclusive definition - inclusive definition corresponds to ‘at or below which’) . E.g. 25% of values lie below the data value corresponding to the 25th percentile. For a random variable, the percentile corresponds to the value of the variable below which a given percentage of the probability is contained (i.e. it is the value of the variable corresponding to the inverse of the cdf - or ppf for the percentage probability expressed as a decimal fraction). See also: quantile.
pmf: The probability mass function (pmf) is the discrete equivalent of the pdf, corresponding to the probability of drawing a given integer value from a discrete probability distribution. The sum of pmf values for all possible outcomes from a discrete probability distribution should equal unity.
population: The notional population of random variates or objects from which a sample is drawn. A population may have some real equivalent (e.g. an actual population of objects which is being sampled). In the frequentist approach to statistics it can also represent the notional infinite set of trials from which a random variable is drawn.
posterior: TBD
power: TBD
ppf: A percent point function (ppf) gives the value of a variable as a function of the cumulative probability that it corresponds to, i.e. it is the inverse of the cdf.
precision: The relative amount of random deviation in a quantity being measured. Measurements of the same quantity but with smaller statistical error are more precise. See also: accuracy.
prior: TBD
probability distribution: Distribution giving the relative frequencies of occurence of a random variable (or variables, for bivariate and multivariate distributions).
propagation of errors: TBD
\(p\)-value: A statistical test probability calculated for a given test statistic and assumptions about how the test statistic is distributed (e.g. depending on the null hypothesis and any other assumptions required for the test).
quantile: Values which divide a probability distribution (or ordered data set) into equal steps in cumulative probability (or cumulative frequency, for data). Common forms of quantile include percentiles, quartiles (corresponding to steps of 25%, known as 1st, 2nd - the median - 3rd and 4th quartile) and deciles (steps of 10%).
random variable: A variable which may taken on random values (variates) with a range and frequency specified by a probability distribution.
random variate: Also known simply as a ‘variate’, a random variate is an observed outcome of a random variable, i.e. drawn from a probability distribution of that variable.
realisation: An observed outcome of a random process, e.g. it may be a set of random variates, or the result of an algorithm applied to a set of random variates.
rug plot: A method of plotting univariate data as a set of (usually vertical) marks representing each data value, along an axis (usually the \(x\)-axis). It is usually combined with a histogram to also show the frequency or probability distribution of the plotted variable.
sample: A set of measurements, drawn from an underlying population, either real (e.g. the height distribution of Dutch adults) or notional (the distribution of possible measurements from an experiment with some random measurement error). A sample may also refer to a set of random variates drawn from a probability distribution.
sample space: The set of all possible outcomes of an experiment or trial.
sampling with replacement: Sampling with replacement is when the sampling process does not reduce the set of outcomes that is sampled from. E.g. rolling a dice multiple times is sampling (the numbers on the dice) with replacement, while repeatedly randomly drawing different-coloured sweets from a bag, without putting them back, is sampling without replacement.
seed: (pseudo-)Random number generators must be ‘seeded’ using a number, usually an integer, which is usually provided automatically by a system call, but may also be specified by the user. Starting from a given seed, a random number generator will return a fixed sequence of pseudo-random variates, as long as the generating function is called repeatedly without resetting the seed. This behaviour must be forced in Python using e.g. np.random.default_rng(331) for a starting seed argument equal to 331. Otherwise, if no argument is given, a seed is chosen from system information on the computer (this is often the preferred option, unless the same random number sequence is required each time the code is run).
significance level: The significance level \(\alpha\) is a pre-specified level of probability required from a significance test in order for a hypothesis to be rejected, i.e. the hypothesis is rejected if the \(p\)-value is less than or equal to \(\alpha\).
significance test: See hypothesis test
skewness: TBD
standard deviation: The standard deviation of a sample of data or a random variable with a probability distribution, is equal to the square-root of variance for that quantity. In the context of time-variable quantities it is also often called the root-mean-squared deviation (or just rms).
standard error: The standard error is the expected standard deviation on the sample mean (with respect to the ‘true’ population mean). For \(n\) measurements drawn from a population with variance \(\sigma^{2}\), the standard error is \(\sigma_{\bar{x}} = \sigma/\sqrt{n}\).
stationary process: A random process is said to be stationary if it is produced by random variates drawn from a probability distribution which is constant and does not change over time.
statistic: A single number calculated by applying a statistical algorithm or function to the values of the items in a sample. The sample mean, sample median and sample variance are all examples of statistics.
statistical error: A random error (deviation from the ‘true’ value(s)) in quantities obtained or derived from data, possibly resulting from random measurement error in the apparatus (or measurer) or due to intrinsic randomness in the quantity being measured (e.g. photon counts) or the sample obtained (i.e. the sample is a random subset of an underlying population, e.g. of stars in a cluster). See also: systematic error, precision.
systematic error: An error that is not random but is a systematic shift away from the ‘true’ value of the measured quantity obtained from data (or a quantity derived from it). E.g. a systematic error may be produced by a fault in the experimental setup or apparatus, or a flaw in the design of a survey so it is biased towards members of the population being sampled with specific properties in a way that cannot be corrected for. See also statistical error, accuracy
statistical test: A test of whether a given test statistic is consistent with its distribution under a specified hypothesis (and associated assumptions).
survival function: A function equal to 1 minus the cdf, i.e. it corresponds to the probability \(P(X\gt x)\) and is therefore useful for assessing \(p\)-values of test statistics.
test statistic: A statistic calculated from data for comparison with a known probability distribution which the test statistic is expected to follow if certain assumptions (including a given hypothesis about the data) are satisfied.
trial: An ‘experiment’ which results in a sample of (random) data. It may also refer to the process of generating a sample of random variates or quantities calculated from random variates, e.g. in a numerical experiment or simulation of a random process.
t-statistic: A test statistic which is defined for a sample with mean \(\bar{x}\) and standard deviation \(s_{x}\) with respect to a population of known mean \(\mu\) as: \(t = (\bar{x}-\mu)/(s_{x}/\sqrt{n})\). \(t\) is drawn from a t-distribution if the sample mean is normally distributed (e.g. via the central limit theorem or if the sample is drawn from a population which is itself normally distributed).
t-test: Any test where the test statistic follows a \(t\)-distribution under the null hypothesis, such as tests using the \(t\)-statistic.
univariate: Involving a single variate, e.g. univariate data is a type of data consisting only of observations/measurements of a single quantity or characteristic; univariate analysis studies statistical properties of a single quantity such as its statistical moments and/or probability distribution.
variance: The (sample) variance \(s_{x}^{2}\) for a quantity \(x_{i}\) measured from a sample of data, is a statistic calculated as the average of the squared deviations of the data values from the sample mean (corrected by Bessel’s correction), i.e. \(\frac{1}{n-1} \sum\limits_{i=1}^{n} (x_{i}-\bar{x})^{2}\). For a random variable \(X\) defined by a probability distribution with pdf \(p(x)\), the (population or distribution) variance \(V[X]\) is the expectation value of the squared difference of the variable from its mean \(\mu\), \(V[X] = E[(X-\mu)^{2}] = \int^{+\infty}_{-\infty} (x-\mu)^{2}p(x)\mathrm{d}x\), which is equivalent to the expectation of squares minus the square of expectations of the variable, \(E[X^{2}]-E[X]^{2}\).
weighted least squares: TBD
weighted mean: TBD
Wilks’ theorem: TBD
Z-statistic: A test statistic which is defined for a sample mean \(\bar{x}\) with respect to a population of known mean \(\mu\) and variance \(\sigma^{2}\) as: \(Z = (\bar{x}-\mu)/(\sigma/\sqrt{n})\). \(Z\) is drawn from a standard normal distribution if the sample mean is normally distributed (e.g. via the central limit theorem or if the sample is drawn from a population which is itself normally distributed).
Z-test: Any test where the test statistic is normally distributed under the null hypothesis, such as tests using the \(Z\)-statistic (although a \(Z\)-test does not have to use the \(Z\)-statistic).