These functions help describe the distribution of random variables and form the backbone of statistical modeling, machine learning, and data-driven decision-making.
- Probability Mass Function (PMF) – Used for discrete random variables. e.g dice rolls
- Probability Density Function (PDF) – Used for continuous random variables. e.g height measurements
Probability Mass Function (PMF)
A Probability Mass Function (PMF) is a function that gives the probability of a discrete random variable being exactly equal to some value. It is used for discrete random variables, where the set of possible outcomes is countable.
Key Properties of PMF
- The PMF assigns probabilities to each possible outcome.
- The sum of all the probabilities for all possible outcomes must equal 1.
- It is defined for discrete random variables, such as counting numbers or categories.
Mathematical Definition
Let X be a discrete random variable with possible values x1,x2,…,xn. The PMF of X, denoted as P(X=x), satisfies the following conditions:
Example of PMF
Let’s consider a fair six-sided die. The outcome of rolling the die is a discrete random variable that can take values 1, 2, 3, 4, 5, or 6, each with an equal probability of occurring.
For a fair die, the PMF is:
import numpy as np
import matplotlib.pyplot as plt
# Define the possible outcomes of a fair 6-sided die
outcomes = np.array([1, 2, 3, 4, 5, 6])
# Probability for each outcome
# (fair die, so each has equal probability)
probabilities = np.array([1/6] * 6)
# Plot the PMF
plt.bar(outcomes, probabilities, width=0.5, color='skyblue')
plt.title("Probability Mass Function of a Fair 6-Sided Die")
plt.xlabel("Outcome")
plt.ylabel("Probability")
plt.show()
Probability Density Function (PDF)
A Probability Density Function (PDF) is used for continuous random variables and describes the probability of the variable falling within a particular range, rather than taking a single value.
Mathematical Definition
For a continuous random variable X with PDF f(x), the probability that X falls within an interval [a,b] is given by:
Key Properties of PDF
-
- The total area under the curve is 1:
Unlike PMFs, the value of f(x) does not represent a direct probability but rather a density.
Example: Normal Distribution
A commonly used PDF in statistics is the Normal Distribution
where:
is the mean,
is the standard deviation.
Relevance to Data Science
Understanding PDFs and PMFs is essential for:
-
Statistical Modeling:
- PMFs are used in discrete models like Poisson regression and binomial distribution.
- PDFs are used in continuous models like linear regression (assuming normal errors).
-
Machine Learning:
- Naïve Bayes Classifier assumes a Gaussian (normal) distribution for continuous features, making PDF crucial.
- Generative models (e.g., Variational Autoencoders) rely on PDFs to model data distributions.
-
Anomaly Detection:
- Many anomaly detection techniques assume normality in data distribution and use PDFs to detect outliers.
Example in Data Science: Naïve Bayes Classifier
A Gaussian Naïve Bayes classifier assumes that each feature follows a normal distribution:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, random_state=42)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Naïve Bayes Classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predictions
y_pred = gnb.predict(X_test)
# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")