1. Introduction

Information theory, pioneered by Claude Shannon in 1948, provides a mathematical framework for quantifying, storing, and communicating information. This tutorial will cover key concepts including Shannon entropy, mutual information, and information gain, which form the basis for understanding more advanced concepts like KLD and cross-entropy.

2. Shannon Entropy

Shannon entropy quantifies the average amount of information contained in a message. For a discrete random variable X with possible values {x₁, x₂, ..., xₙ} and probability mass function P(X), the Shannon entropy H(X) is defined as:

$$ H(X) = -∑P(xᵢ) log₂ P(xᵢ) $$

Where:

The sum is over all possible values of X
log₂ is the base-2 logarithm
The unit of entropy is bits (when using log₂)

2.1. Python implementation:

import numpy as np

def shannon_entropy(p):
    """Compute Shannon entropy of a discrete probability distribution."""
    # Remove zero probabilities
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

# Example
p = np.array([0.5, 0.25, 0.25])
print(f"Shannon entropy: {shannon_entropy(p):.4f} bits")

Shannon entropy: 1.5000 bits

3. Joint Entropy

For two discrete random variables X and Y, the joint entropy H(X,Y) is defined as:

$$ H(X,Y) = -∑∑P(x,y) log₂ P(x,y) $$

Where P(x,y) is the joint probability distribution of X and Y.

3.1. Python implementation:

import numpy as np

def joint_entropy(p_xy):
    """Compute joint entropy of two discrete random variables."""
    # Remove zero probabilities
    p_xy = p_xy[p_xy > 0]
    return -np.sum(p_xy * np.log2(p_xy))

# Example
p_xy = np.array([[0.2, 0.1], [0.3, 0.4]])
print(f"Joint entropy: {joint_entropy(p_xy):.4f} bits")

Joint entropy: 1.8464 bits