This article covers several steps you should consider taking before you analyze your probability data and apply the analysis results.
The very first step is to define what you are trying to achieve by analyzing your data. You should have a clear understanding of your goals as this will help you throughout the entire data analysis process. Try answering the following questions:
Example:
Robert is the head of the Customer Support Department at a large company. In order to reduce the customer service times and improve the customer experience, he would like to do the following:
Preparing your data for distribution fitting is one of the most important steps you should take, since the analysis results (and thus the decisions you make) depend on whether you correctly collect and specify the input data.
Your data might come in one of the generally accepted formats, depending on the source of data and how it was collected. You need to make sure the distribution fitting software you are using supports the data format you need, and if it doesn't, you might need to convert your data to one of the supported formats.
The most commonly used format in probabaility data analysis is an unordered set of values obtained by observing some random process. The order of values in a data set is not important and does not affect the distribution fitting results. This is one of the fundamental differences between distribution fitting (and probability data analysis in general) and time series analysis where each data value is connected to some time point at which this value was observed.
The rule of thumb is the more data you have, the better. In most cases, to get reliable distribution fitting results, you should have at least 75-100 data points available. Note that very large samples (tens of thousands of data points) might cause some computational problems when fitting distributions to data, and you might need to reduce the sample size by selecting a subset of your data.
Before fitting distributions to your data, you should decide which distributions are appropriate based on the additional information about the data you have. This can be helpful to narrow your choice to a limited number of distributions before you actually perform distribution fitting.
The easiest part is to determine whether your data is continuous or discrete. If your data can take on real values (for example, 1.5 or -2.33), then you should consider continuous distributions only. On the other hand, if your data can take on integer values (1, 2, -5 etc.) only, then you might want to fit both continuous and discrete distributions.
The reason to use continuous distributions to analyze discrete data is that there is a large number of continuous distributions which frequently provide much better fit than discrete distributions. However, if you are confident that your random data follows a certain discrete distribution, you might want to use that specific distribution rather than continuous models.
In most cases, you have not just raw data, you also have some additional information about the data and its properties, how the data was collected etc. This information might be very useful to narrow your choice to several probability distributions.
For example, if you are analyzing the sales data of a company, it should be clear that this kind of data cannot contain negative values (unless the company sells at a loss), and thus it wouldn't make much sense to fit distributions which can take on negative values (such as the Normal distribution) to your data.
In addition, some particular distributions are recommended for use in several specific industries. An obvious example of such an industry is reliability engineering which makes great use of the Weibull distribution and several additional models (Exponential, Lognormal, Gamma) to perform the analysis of failure data. These distributions are widely used in many other industries, but in reliability engineering they are considered "standard".
Read Part II: Distribution Fitting - Analysis & Applications