 Distribution Fitting Software & Articles

EasyFit: select the best fitting distribution and use it to make better decisions. learn more # Goodness of Fit Tests

## Introduction

The goodness of fit (GOF) tests measure the compatibility of a random sample with a theoretical probability distribution function. In other words, these tests show how well the distribution you selected fits to your data.

The general procedure consists of defining a test statistic which is some function of the data measuring the distance between the hypothesis and the data, and then calculating the probability of obtaining data which have a still larger value of this test statistic than the value observed, assuming the hypothesis is true. This probability is called the confidence level.

Small probabilities (say, less than one percent) indicate a poor fit. Especially high probabilities (close to one) correspond to a fit which is too good to happen very often, and may indicate a mistake in the way the test was applied.

## Kolmogorov-Smirnov Test

This test is used to decide if a sample comes from a hypothesized continuous distributuion. It is based on the empirical cumulative distribution function (ECDF). Assume that we have a random sample x1, ... , xn from some continuous distribution with CDF F(x). The empirical CDF is denoted by ### Definition

The Kolmogorov-Smirnov statistic (D) is based on the largest vertical difference between F(x) and Fn(x). It is defined as H0: The data follow the specified distribution.
HA: The data do not follow the specified distribution.

The hypothesis regarding the distributional form is rejected at the chosen significance level (alpha) if the test statistic, D, is greater than the critical value obtained from a table.

The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. This test gives more weight to the tails than the Kolmogorov-Smirnov test.

### Definition

The Anderson-Darling statistic (A2) is defined as H0: The data follow the specified distribution.
HA: The data do not follow the specified distribution.

The hypothesis regarding the distributional form is rejected at the chosen significance level (alpha) if the test statistic, A2, is greater than the critical value obtained from a table.

## Chi-Squared Test

The Chi-Squared test is used to determine if a sample comes from a population with a specific distribution. This test is applied to binned data, so the value of the test statistic depends on how the data is binned.

Although there is no optimal choice for the number of bins (k), there are several formulas which can be used to calculate this number based on the sample size (N). For example, EasyFit employs the following empirical formula: The data can be grouped into intervals of equal probability or equal width. The first approach is generally more acceptable since it handles peaked data much better. Each bin should contain at least 5 or more data points, so certain adjacent bins sometimes need to be joined together for this condition to be satisfied.

### Definition

The Chi-Squared statistic is defined as ,

where Oi is the observed frequency for bin i, and Ei is the expected frequency for bin i calculated by ,

where F is the CDF of the probability distribution being tested, and x1, x2 are the limits for bin i.

H0: The data follow the specified distribution.
HA: The data do not follow the specified distribution.

The hypothesis regarding the distributional form is rejected at the chosen significance level ( ) if the test statistic is greater than the critical value defined as meaning the Chi-Squared inverse CDF with k-1 degrees of freedom and a significance level of .