A Simple Intuition Behind The Normal Distribution Equation
Ever since I learned about the Normal Distribution Equation in school it always looked menacing to me. I could not memorize it and I could not make any sense of it. I can kind of understand why the equation includes the sigma, mean, and the x’s. But where did e come from? Pi? And these strange fractions everywhere.
In this article we are going to derive the equation for the normal distribution (pdf) from scratch, step-by-step, using simple examples. I guarantee you by the end of this article you will be able to do it on your own with minimal algebra and probability knowledge required.
Table of Contents:
- From Darts to Exponents
- A “Normal Distribution” with Two Parameters
- From Two to One
- From One to Spread
- The Final Form: Spread and Mean
The full code is available on my GitHub. The resources that were used in preparation for this article are: , , , .
1. From Darts to Exponents
Let’s imagine we are throwing darts at a dartboard. As we do this, we record each dart’s position on the board. Below is a scatter plot of the final result, where the bullseye is centered on the origin (0, 0).
As we can expect, most of the darts on the board cluster in the close proximity to the desired target. As we move further away from the bullseye, fewer and fewer darts fall there. How can we come up with a function to calculate a probability of a dart in a particular location on the board? In other words, how can we define a pdf that would take a location on the board and produce a probability of a dart being there?
Let’s define a pdf called phi that would take a point on the dartboard and produce a probability of finding a dart near that point. We are going to define a small box dA to describe how near to that point we want to find the darts.
Then, the probability of finding a dart inside that box can be described by the following formula:
Please note that boxes that lie closer to the origin will produce higher probability, as it naturally follows from our assumptions that there are more darts closer to the bullseye. Also, if two boxes are the same distance away from the origin, then the larger box will have a higher probability, since it has a larger area.
You might have noticed that I use the word distance as opposed to specific coordinates of x and y. This is because y and x are independent of each other: knowing one does not tell you anything about the other. If you aim your dart too high, say at y=2.5, then it does not tell you anything about which side of the board (i.e., x) you are going to hit. It follows that if you rotate a box dA preserving it’s distance r and its area dA, you will get the same value from phi.
Let’s define a new pdf f(x) that would take a coordinate x and produce the probability of a dart at that coordinate. Rewriting the previous formula, we get
where f can be thought of some pdf independently operating on the x and y coordinates. Since x and y are independent, we can safely multiply them like this, following the probability independence rule. By cancelling out dA’s we get the following:
Recall, that by the Pythagorean theorem, in the right triangle, the hypothenuse equals the squared root of the squared sum of two legs. Graphically, this can be depicted like so
Using this information, we can rewrite the previous equation like this
Let’s set y=0 and plug it into our equation
Since f(0) is some unknown constant, let’s name it lambda
The resultant equation tells us that the function phi operating on some input x is a scalar multiple of f operating on the same input. Using this property, let’s rewrite the previous equation like so
In other words,
Let’s divide both sides by lambda squared and simplify a bit
Our goal now is to find a certain function g that would satisfy the property above. That is, if we take a function g and apply it on the x input times the y input we should get a squared root of the sum of squares of those x and y inputs. To find such a function, let’s consider a simpler case
What kind of functions have this property? Recalling High School algebra, we can notice that exponentiation functions satisfy the property above. For example,
Although the function h looks similar to the function g, they are not exactly the same. The function g has a squared root in the right hand side of the equation. Therefore, to get rid of it, we can square our inputs of the function g. Squaring x gives us x², squaring y gives us y², and squaring sqrt(x²+y²) gives us x²+y². We are ready to substitute these values back
You might notice that we chose the base 5 for our formula. In fact, we can choose any base, like 1, 2 or e. To generalize over all the bases, let’s pick the base of e^A, where A is some unknown constant.
Substituting back into g we finally get the following
2. A “Normal Distribution” with Two Parameters
Let’s plot the resultant distribution using various values for the lambda and A
Oh no, our distribution is upside down. So let’s fix that by substituting A for the following:
We get the following equation
Now, by plotting our distributions with various parameters of lambda and h we can observe how to shape of the distribution changes depending on the values of lambda and h.
3. From Two to One
So far so good, the curve looks like a normal distribution. However, one problem still persists: for this curve to be considered a true pdf, the area under our curve should be equal to 1. Mathematically,
Let’s solve this integral. Substituting our equation we get
The resultant integral is called Gaussian integral and is equal to the square root of pi. I don’t know why that is, but the complete derivation can be looked up on the Wikipedia. So we get
Let’s substitute h² in our original equation we derived earlier. We finally get
There are 2 important things we did. First, we got rid of the unknown h, leaving only a single unknown — lambda. Second, since we computed the integral equals 1, we are now guaranteed to have the area under the curve equal to 1. By constraining ourselves with a single parameter lambda, we can see how, unlike our previous examples, the width of the distribution now depends on the value of lambda.
4. From One to Spread
What lambda essentially controls is the spread of the values (variance- sigma²). The higher the value — the lower the spread — the narrower the distribution is. Let’s define that relationship between lambda and sigma mathematically.
We know that the variance of a pdf is calculated by the following formula:
Substituting for the f(x) gives us
Now we can clearly see that lambda and sigma are inversely proportional: the higher the lambda — the lower the sigma, and vice versa.
Performing the final substitution for lambda we get
5. The Final Form: Spread and Mean
The formula above describes a normal distribution that is centered on 0. That is, the mean is 0. Suppose we want to be able to center it on different values. From the perspective of the distribution, it means shifting it along the x-axis to the right (positive mean) and to the left (negative mean).
In other words, we need to add (substract) a constant value mu to (from) each of the x’s. As the previous equation shows, there is only one x in that equation, so we simply modify it like so
Recall that to move a curve on the x-axis to the right we need to subtract the constant, and to move it to the left, we need to add it. Since we want positive means to be on the right, we substract it from the x’s.
And this is it, we have finally arrived at the final equation for the normal distribution. I hope this article helped you to gain a bit of intuition behind the normal distribution formula!