This article covers the most commonly used graph types and explains how they can be used to fit probability distributions to your data.
Along with the goodness of fit tests, the distribution graphs can be very helpful to determine the best fitting model. The fundamental difference of this approach is that it is quite subjective: while the goodness of fit tests are "exact" in the sense that the results do not depend on the researcher (provided that the tests are performed correctly), using various graphs is a more empirical way to analyze your data.
Why don't just use the goodness of fit tests? There are several commonly used tests, all of which tell you whether a particular distribution is a good fit. However, these tests differ in how they are performed, and sometimes do not agree with one another. For example, you might find out that according to the Kolmogorov-Smirnov test, the Weibull distribution is the best fit, but the Anderson-Darling test suggests that it is not. That is when the graphs come in handy.
There are at least five useful graph types:
These graphs are usually created based on your sample data and one or several fitted distributions. Typically, displaying several distributions at the same time allows you to visually compare the models and determine how they differ. EasyFit allows you to select multiple fitted models by holding down the Ctrl key and clicking ths distribution names. To switch between the different graph types, you can use the View|Graph Type menu option, or just click the toolbar icons:
The graphs can be zoomed, panned, exported to a variety of formats, or printed. In addition, you can customize the graphs to fit the style of your presentations or documents. These options are available from the popup menu (right-click the graph).
The histogram graphically shows various properties of your data, including the location, scale, and shape, helping you visually identify an underlying probability distribution:
The histogram depends on how the data is sorted into bins (classes). Although there is no generally accepted way to select the number of bins, it can be chosen using some empirical formulas based on the sample size (the total number of observations). EasyFit uses the popular Sturges' formula:
where N is the sample size, and k is the resulting number of bins. Alternatively, you can manually specify this number in the Graph Options window (right-click the graph and select Options).
The height of each bar is calculated as the number of data points falling into that class, divided by the total number of observations. This kind of histogram is also called the relative histogram, since the bar heights represent the proportion of the data in each class (see also How To Create And Use Histograms).
The Cumulative Distribution Function graph displays the theoretical CDF of the fitted distributions and the empirical CDF based on your sample data. While the PDF graph mainly shows the shape of your data, the CDF graph is useful to actually determine how well the distributions fit to data:
The empirical CDF graph also depends on the number of bins chosen. As you increase this number, the ECDF curve gets smoother:
The probability-probability plot is a graph of the empirical CDF values plotted against the theoretical (fitted) CDF values. It is used to determine how well a specific distribution fits to the observed data. The P-P plot will be approximately linear if the specified theoretical distribution is the correct model. EasyFit displays the diagonal line along which the graph points should fall:
The quantile-quantile plot is a graph of the input data values plotted against the quantiles (inverse CDF values) of the fitted distribution. Both axes of this graph are in units of the input data set:
The interpretation of the Q-Q plot is similar to the P-P plot: if the distribution you are testing is the correct model, the graph points will lie on an approximately straight line.
The probability difference graph is a plot of the difference between the empirical cumulative distribution function and the fitted CDF:
While the P-P and Q-Q plots show you whether the fit is good, they don't provide any quantitative information on how good it is. The probability difference graph is much closer to the classical goodness of fit tests in this respect: in fact, the Kolmogorov-Smirnov test is based on measuring the difference of probabilities. The less the absolute value of this difference, the better the fit:
If the maximum absolute difference is less than 0.05 (or 5%), the fit can be considered good. For very good fits, this value will be less than 1%.
There are a variety of graphs which can be helpful for visualizing your data, comparing multiple fitted distributions, and selecting the best model. These graphs should be viewed as an addition to, rather a replacement for, the goodness of fit tests. EasyFit makes it easy for you to combine these two approaches: it orders the fitted distributions according to the GOF statistics, and displays the graphs for one or several models you select.
See also: EasyFit Help on distribution graphs