Distribution Fitting - Preliminary Steps

EasyFit: select the best fitting distribution and use it to make better decisions. learn more

This article covers several steps you should consider taking before you analyze your probability data and apply the analysis results.

Step 1 - Define The Goals Of Your Analysis

The very first step is to define what you are trying to achieve by analyzing your data. You should have a clear understanding of your goals as this will help you throughout the entire data analysis process. Try answering the following questions:

What kind of information would you like to obtain?
How will you obtain the information you need?
How will you apply that information?

Example:

Robert is the head of the Customer Support Department at a large company. In order to reduce the customer service times and improve the customer experience, he would like to do the following:

Determine the probability that a customer can be served in 5 minutes or less. To solve this problem, Robert needs to:
- Perform distribution fitting to sample data (customer service times) for a selected period of time (e.g. last week)
- Select the best fitting distribution
- Calculate the probability using the cumulative distribution function of the selected distribution
If the probability is less than 95%, consider hiring additional customer support staff

Step 2 - Prepare Data For Distribution Fitting

Preparing your data for distribution fitting is one of the most important steps you should take, since the analysis results (and thus the decisions you make) depend on whether you correctly collect and specify the input data.

Data Format

Your data might come in one of the generally accepted formats, depending on the source of data and how it was collected. You need to make sure the distribution fitting software you are using supports the data format you need, and if it doesn't, you might need to convert your data to one of the supported formats.

The most commonly used format in probabaility data analysis is an unordered set of values obtained by observing some random process. The order of values in a data set is not important and does not affect the distribution fitting results. This is one of the fundamental differences between distribution fitting (and probability data analysis in general) and time series analysis where each data value is connected to some time point at which this value was observed.

Sample Size

The rule of thumb is the more data you have, the better. In most cases, to get reliable distribution fitting results, you should have at least 75-100 data points available. Note that very large samples (tens of thousands of data points) might cause some computational problems when fitting distributions to data, and you might need to reduce the sample size by selecting a subset of your data.

top

Step 3 - Decide Which Distributions To Fit

Before fitting distributions to your data, you should decide which distributions are appropriate based on the additional information about the data you have. This can be helpful to narrow your choice to a limited number of distributions before you actually perform distribution fitting.

Data Domain - Continuous or Discrete?

The easiest part is to determine whether your data is continuous or discrete. If your data can take on real values (for example, 1.5 or -2.33), then you should consider continuous distributions only. On the other hand, if your data can take on integer values (1, 2, -5 etc.) only, then you might want to fit both continuous and discrete distributions.

The reason to use continuous distributions to analyze discrete data is that there is a large number of continuous distributions which frequently provide much better fit than discrete distributions. However, if you are confident that your random data follows a certain discrete distribution, you might want to use that specific distribution rather than continuous models.

The Nature of Your Data

In most cases, you have not just raw data, you also have some additional information about the data and its properties, how the data was collected etc. This information might be very useful to narrow your choice to several probability distributions.

For example, if you are analyzing the sales data of a company, it should be clear that this kind of data cannot contain negative values (unless the company sells at a loss), and thus it wouldn't make much sense to fit distributions which can take on negative values (such as the Normal distribution) to your data.

In addition, some particular distributions are recommended for use in several specific industries. An obvious example of such an industry is reliability engineering which makes great use of the Weibull distribution and several additional models (Exponential, Lognormal, Gamma) to perform the analysis of failure data. These distributions are widely used in many other industries, but in reliability engineering they are considered "standard".

Read Part II: Distribution Fitting - Analysis & Applications

top