How To Create And Use Histograms

EasyFit: select the best fitting distribution and use it to make better decisions. learn more

Using histograms is perhaps the most popular and intuitive way to display random data. Histograms can be helpful for visualizing the shape of your data before the distributions are fitted, as well as to see how good a certain distribution fits to your data.

Contents
What Is a Histogram?
Creating a Histogram
Displaying Fitted Distributions On Top Of a Histogram

What Is a Histogram?

A histogram is a graph that consists of a number of "bins", or vertical bars, into which the sample values are sorted.

The height of each histogram bar indicates how many of your data points fall into that bin, relative to the total number of data values, so this kind of chart is also called the relative frequency histogram.

Creating a Histogram

Even though EasyFit automatically creates histograms based on sample data, understanding how it works would be useful.

The first step is to choose the number of bins, or classes, into which your data will be sorted. There are several ways to do this, and one of the most commonly used methods is to define the number of bins based on the total number of observations:

k = 1 + log₂N,

where N is the total number of data values, and k is the resulting number of bins. If you get a non-integer k using this formula, you should round it to the nearest integer. When using EasyFit, you can either have the number of bins calculated automatically, or manually specify it through the Options|Graph menu.

The next step is to divide the entire range of your data from x_min to x_max into k intervals of equal width, and calculate how many values fall into each interval. And finally, the height of each bar is calculated as the number of data points falling into that interval, divided by the total number of observations.

Note that when displaying the resulting bars, they must be adjacent - there mustn't be any space between the neighboring bars. Histograms are frequently confused with "bar charts" used to display categorical data, meaning that you can have non-numerical values on the x-axis, so the distance between the bars, as well as their particular order, is not really important, which is not the case for histograms.

Displaying Fitted Distributions On Top Of a Histogram

Aside from using histograms for initial visual analysis, you can apply them to compare the fitted distributions to your sample data, and possibly select the best fitting model, or at least reject the distributions that don't fit to your data very well. To do this, you should plot the fitted Probability Density Functions, or PDFs, on top of your histogram.

For instance, the graph on the right indicates that the Gumbel distribution fits to a data set much better than the Normal distribution.

How To Properly Display The Probability Density Function

One source of confusion is the fact that the PDF, like any regular function with fixed parameters, has a constant shape, while the appearance of a histogram can change depending on the number of bins. Using a larger number of bins can make your histogram more detailed, but that will also decrease the height of a histogram (note the y-axis values):

To correctly display the PDF on top of a histogram, it must be scaled depending on the number of bins. Assuming that W is the bin width, and n_i (i=1...k) is the number of data values falling into each bin, we can calculate the total area of the histogram:

Area = SUM(W*n_i/N) = W/N*SUM(n_i) = W/N*N = W

However, according to the definition, the area under the Probability Density Function graph must equal 1, so the theoretical PDF(x) values have to be multiplied by the bin width W to match the histogram. Of course, this new W*PDF(x) function will not be the "real" PDF anymore, but it will still have the same shape useful for comparing against the histogram.

EasyFit automatically scales the density curve based on the number of histogram bins, allowing you to visually identify the distributions that fit to your data well. If you need to see the original unscaled PDF graph of a fitted distribution, you can use StatAssist, the built-in distribution viewer tool.

Conclusions

The histograms are widely used for random data visualization and analysis, however, extra care must be taken not to confuse histograms with bar charts, as well as to properly scale the Probability Density Function graphs when displaying them on top of histograms. With EasyFit, you can easily create histograms with variable number of bins, and overlay one or several fitted distributions to compare them against your sample data.

top