PDF (Probability Desity Function) and PMF (Probability Mass Function)

These functions help describe the distribution of random variables and form the backbone of statistical modeling, machine learning, and data-driven decision-making.

  • Probability Mass Function (PMF) – Used for discrete random variables. e.g dice rolls
  • Probability Density Function (PDF) – Used for continuous random variables. e.g height measurements

Probability Mass Function (PMF)

A Probability Mass Function (PMF) is a function that gives the probability of a discrete random variable being exactly equal to some value. It is used for discrete random variables, where the set of possible outcomes is countable.

Key Properties of PMF

  • The PMF assigns probabilities to each possible outcome.
  • The sum of all the probabilities for all possible outcomes must equal 1.
  • It is defined for discrete random variables, such as counting numbers or categories.

Mathematical Definition

Let be a discrete random variable with possible values x1,x2,…,xn. The PMF of , denoted as P(X=x), satisfies the following conditions:

  1. 0 \le P(X =  \! x) \le1
  2. \sum P(X = \! x) = 1
Example of PMF

Let’s consider a fair six-sided die. The outcome of rolling the die is a discrete random variable that can take values 1, 2, 3, 4, 5, or 6, each with an equal probability of occurring.

For a fair die, the PMF is:

    \[ P(X = x) = \frac{1}{6}, \quad x \in \{1, 2, 3, 4, 5, 6\} \]

import numpy as np
import matplotlib.pyplot as plt

# Define the possible outcomes of a fair 6-sided die
outcomes = np.array([1, 2, 3, 4, 5, 6])

# Probability for each outcome 
# (fair die, so each has equal probability)
probabilities = np.array([1/6] * 6)

# Plot the PMF
plt.bar(outcomes, probabilities, width=0.5, color='skyblue')
plt.title("Probability Mass Function of a Fair 6-Sided Die")
plt.xlabel("Outcome")
plt.ylabel("Probability")
plt.show()

Probability Density Function (PDF)

A Probability Density Function (PDF) is used for continuous random variables and describes the probability of the variable falling within a particular range, rather than taking a single value.

Mathematical Definition

For a continuous random variable with PDF f(x), the probability that falls within an interval [a,b] is given by:

    \[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]

Key Properties of PDF

  1. f(\! x) \geq 0 \, \text{for all} \, \quad x

  2. The total area under the curve is 1:

    \[ \int_{-\infty}^{\infty} f(x) \, dx = 1 \]

Unlike PMFs, the value of f(x) does not represent a direct probability but rather a density.

Example: Normal Distribution

A commonly used PDF in statistics is the Normal Distribution

    \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \, e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

where:

  • \mu is the mean,
  • \sigma is the standard deviation.

Relevance to Data Science

Understanding PDFs and PMFs is essential for:

  1. Statistical Modeling:

    • PMFs are used in discrete models like Poisson regression and binomial distribution.
    • PDFs are used in continuous models like linear regression (assuming normal errors).
  2. Machine Learning:

    • Naïve Bayes Classifier assumes a Gaussian (normal) distribution for continuous features, making PDF crucial.
    • Generative models (e.g., Variational Autoencoders) rely on PDFs to model data distributions.
  3. Anomaly Detection:

    • Many anomaly detection techniques assume normality in data distribution and use PDFs to detect outliers.

Example in Data Science: Naïve Bayes Classifier

A Gaussian Naïve Bayes classifier assumes that each feature follows a normal distribution:

    \[ P(X|Y) = \frac{1}{\sigma \sqrt{2\pi}} \, e^{-\frac{(X - \mu)^2}{2\sigma^2}} \]

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, random_state=42)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naïve Bayes Classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Search

Table of Contents

You may also like to read