Distribution Fitting Software & Articles

# Distribution Fitting - Analysis & Applications

## The Distribution Fitting Process

### Manual Distribution Fitting

### Automated Distribution Fitting

## Selecting The Best Fitting Distribution

### Distribution Graphs

### Goodness of Fit Tests

## Applying The Selected Distribution

### Typical Applications

### Specific Applications

This article highlights important aspects of fitting probability distributions to data and applying the analysis results to make informed decisions.

Before you proceed to analyzing your data, make sure you have taken the steps covered in the article Distribution Fitting - Preliminary Steps.

Once you have selected the candidate distributions which can supposedly provide a good fit (see the article above), you are ready to actually fit these distributions to your data.

The process of fitting distributions involves the use of certain statistical methods allowing to estimate the distribution parameters based on the sample data. This is where distribution fitting software can be very useful: it implements the parameter estimation methods for most commonly used distributions, so you can save your time and focus on the data analysis itself. If you are fitting several different distributions, which is usually the case, you need to estimate the parameters of each distribution separately.

The input of distribution fitting software usually includes:

- Your data in one of the accepted formats
- Distributions you want to fit
- Distribution fitting options

The distribution fitting results include the following elements:

- Graphs of your input data
- Parameters of the fitted distributions
- Graphs of the fitted distributions
- Additional graphs and tables helping you select the best fitting distribution

The process of fitting probability distributions to data is usually computationally intensive, and it is not feasible to perform this task using manual methods. However, sometimes you might already know the underlying distribution.

For example, if you are analyzing the distribution of the customer service time, you might want to narrow your choice to the Exponential distribution which is quite frequently used for this kind of analysis. You would only need to estimate the distribution parameters based on the sample data, which can be easily done using distribution fitting software.

In addition, you might already know not just the distribution model, but also the approximate values of the parameters of this model, based on the nature of your data. In this case, the goal of your analysis might be to verify whether your assumption regarding the probability distribution is correct.

One of the benefits of using distribution fitting software for probability data analysis is the ability to automatically fit a large number of distributions to your data in a batch. This is the preferred mode of operation if you have no or little information about the underlying probability distribution you are trying to determine.

After the distributions are fitted, you can compare them and select the best fitting model. There are a number of statistical methods and tools available which can help you perform this task. These tools are usually implemented in distribution fitting software in the form of various graphs and tables displayed along with the estimated distribution parameters.

The distribution graphs enable you to:

*Visually*assess the goodness of fit of a certain distribution- Compare several fitted models

Some of the graphs display both your input data (e.g. the histogram) and fitted distributions at the same time:

- Probability Density Function Graph
- Cumulative Distribution Function Graph

The following graphs display the fitted distributions only:

- P-P Plot
- Q-Q Plot
- Probability Difference Graph

Each graph has its own meaning and interpretation. Typically, distribution fitting software will display these graphs for one or several fitted distributions, depending on your choice. In manual fitting mode, the graphs update automatically while you modify the distribution parameters, making the process of fitting more interactive.

As the name suggests, the goodness of fit tests can be used to determine whether
a certain distribution is a good fit. Calculating the goodness of fit
statistics also enables you to *order the fitted distributions* accordingly to
how good they fit to your data. This particular feature is very helpful for comparing
the fitted models.

The most commonly used goodness of fit tests are Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared. From the viewpoint of a user, the logic of applying these tests is the same, however, they are different in how they are performed (implemented). The Kolmogorov-Smirnov test can be considered the most widely used goodness of fit test.

The ultimate goal of your analysis is to obtain the information which will help you make
informed decisions under uncertainty. The information you need can be derived using the
best fitting distribution, which is the *model* of the real-world random process you
are dealing with.

Some of the typical applications of probability distributions include:

- Calculating probabilities
- Making estimates
- Calculating statistics

The calculations can be done using the corresponding functions of the distribution you have selected, including the Cumulative Distribution Function (CDF), Inverse CDF, Hazard Function etc.

**Calculating probabilities** is one of the most popular applications: in a typical data
analysis, you would define a good (desired) outcome, and calculate the probability of that
outcome. If the probability is high enough, then the decision that will result in
the desired outcome is worth making. On the other hand, if the probability is too low,
then you should make the opposite decision.

For example, if you are analyzing the distribution of the customer service time, the outcomes might look like:

- Good outcome: A customer can be served in 5 minutes or less
- Bad outcome: It takes more than 5 minutes to serve a customer

The corresponding decisions are:

- Decision A: Do not hire extra staff
- Decision B: Hire additional staff

The probabilities can be easily calculated using the Cumulative Distribution Function (CDF) of the selected distribution, for instance, CDF(5) represents the probability of the good outcome. If this value is less than a certain fixed level (e.g. < 95%), you might consider hiring additional staff to reduce the service time and improve the customer experience. The probability of 90% would mean that 10% of your customers have to wait longer and might be unhappy with the customer service your company offers.

Sometimes you might want to define more than two outcomes:

- Outcome A: A customer can be served in under 5 minutes or less
- Outcome B: A customer can be served in 5 to 6 minutes
- Outcome C: It takes more than 6 minutes to serve a customer

The probabilities can be calculated in a similar way, and might look like:

- Probability(Outcome A) = 90%
- Probability(Outcome B) = 7%
- Probability(Outcome C) = 3%

In this case, your decision might be not to hire additional staff, because only 3% of your customers are served in more than 6 minutes.

**Making estimates** is an inverse problem requiring you to specify a fixed probability
value. For example, you would like to estimate how long it takes to serve 95% of the customers.
To make the estimate, you can use the Inverse Cumulative Distribution Function (ICDF) of the
distribution you have selected: ICDF(0.95)=5.5 minutes. The interpretation is that even though
only 90% of the customers are served in under 5 minutes (see the example above), another 5% wait
for 0.5 minutes (30 seconds) more, which is quite acceptable.

**Calculating statistics** can be useful to take a quick look at your data (note that it
is not correct to base your decisions on the statistics alone). The most useful statistics include:

- Mean (the average value)
- Mode (the most likely value)

For example, you might find out that a customer is most likely to be served in 2 minutes, but there are many customers which require more time, so the average service time is 3 minutes.

Even though probability distributions can be applied in any industry dealing with random data, there are additional applications arising in specific industries (actuarial science, finance, reliability engineering, hydrology etc.), enabling business analysts, engineers and scientists to make informed decisions under uncertainty.