This article highlights important aspects of fitting probability distributions to data and applying the analysis results to make informed decisions.
Before you proceed to analyzing your data, make sure you have taken the steps covered in the article Distribution Fitting - Preliminary Steps.
Once you have selected the candidate distributions which can supposedly provide a good fit (see the article above), you are ready to actually fit these distributions to your data.
The process of fitting distributions involves the use of certain statistical methods allowing to estimate the distribution parameters based on the sample data. This is where distribution fitting software can be very useful: it implements the parameter estimation methods for most commonly used distributions, so you can save your time and focus on the data analysis itself. If you are fitting several different distributions, which is usually the case, you need to estimate the parameters of each distribution separately.
The input of distribution fitting software usually includes:
The distribution fitting results include the following elements:
The process of fitting probability distributions to data is usually computationally intensive, and it is not feasible to perform this task using manual methods. However, sometimes you might already know the underlying distribution.
For example, if you are analyzing the distribution of the customer service time, you might want to narrow your choice to the Exponential distribution which is quite frequently used for this kind of analysis. You would only need to estimate the distribution parameters based on the sample data, which can be easily done using distribution fitting software.
In addition, you might already know not just the distribution model, but also the approximate values of the parameters of this model, based on the nature of your data. In this case, the goal of your analysis might be to verify whether your assumption regarding the probability distribution is correct.
One of the benefits of using distribution fitting software for probability data analysis is the ability to automatically fit a large number of distributions to your data in a batch. This is the preferred mode of operation if you have no or little information about the underlying probability distribution you are trying to determine.
After the distributions are fitted, you can compare them and select the best fitting model. There are a number of statistical methods and tools available which can help you perform this task. These tools are usually implemented in distribution fitting software in the form of various graphs and tables displayed along with the estimated distribution parameters.
The distribution graphs enable you to:
Some of the graphs display both your input data (e.g. the histogram) and fitted distributions at the same time:
The following graphs display the fitted distributions only:
Each graph has its own meaning and interpretation. Typically, distribution fitting software will display these graphs for one or several fitted distributions, depending on your choice. In manual fitting mode, the graphs update automatically while you modify the distribution parameters, making the process of fitting more interactive.
As the name suggests, the goodness of fit tests can be used to determine whether a certain distribution is a good fit. Calculating the goodness of fit statistics also enables you to order the fitted distributions accordingly to how good they fit to your data. This particular feature is very helpful for comparing the fitted models.
The most commonly used goodness of fit tests are Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared. From the viewpoint of a user, the logic of applying these tests is the same, however, they are different in how they are performed (implemented). The Kolmogorov-Smirnov test can be considered the most widely used goodness of fit test.
The ultimate goal of your analysis is to obtain the information which will help you make informed decisions under uncertainty. The information you need can be derived using the best fitting distribution, which is the model of the real-world random process you are dealing with.
Some of the typical applications of probability distributions include:
The calculations can be done using the corresponding functions of the distribution you have selected, including the Cumulative Distribution Function (CDF), Inverse CDF, Hazard Function etc.
Calculating probabilities is one of the most popular applications: in a typical data analysis, you would define a good (desired) outcome, and calculate the probability of that outcome. If the probability is high enough, then the decision that will result in the desired outcome is worth making. On the other hand, if the probability is too low, then you should make the opposite decision.
For example, if you are analyzing the distribution of the customer service time, the outcomes might look like:
The corresponding decisions are:
The probabilities can be easily calculated using the Cumulative Distribution Function (CDF) of the selected distribution, for instance, CDF(5) represents the probability of the good outcome. If this value is less than a certain fixed level (e.g. < 95%), you might consider hiring additional staff to reduce the service time and improve the customer experience. The probability of 90% would mean that 10% of your customers have to wait longer and might be unhappy with the customer service your company offers.
Sometimes you might want to define more than two outcomes:
The probabilities can be calculated in a similar way, and might look like:
In this case, your decision might be not to hire additional staff, because only 3% of your customers are served in more than 6 minutes.
Making estimates is an inverse problem requiring you to specify a fixed probability value. For example, you would like to estimate how long it takes to serve 95% of the customers. To make the estimate, you can use the Inverse Cumulative Distribution Function (ICDF) of the distribution you have selected: ICDF(0.95)=5.5 minutes. The interpretation is that even though only 90% of the customers are served in under 5 minutes (see the example above), another 5% wait for 0.5 minutes (30 seconds) more, which is quite acceptable.
Calculating statistics can be useful to take a quick look at your data (note that it is not correct to base your decisions on the statistics alone). The most useful statistics include:
For example, you might find out that a customer is most likely to be served in 2 minutes, but there are many customers which require more time, so the average service time is 3 minutes.
Even though probability distributions can be applied in any industry dealing with random data, there are additional applications arising in specific industries (actuarial science, finance, reliability engineering, hydrology etc.), enabling business analysts, engineers and scientists to make informed decisions under uncertainty.