Introduction:

In the introduction, Fisher highlights the importance of using multiple measurements to discriminate between populations or species. He mentions previous applications of this idea in craniometry and analyzing secular trends. The main goal of the paper is to illustrate how linear functions of multiple measurements can be used to maximize the distinction between groups, focusing on a taxonomic problem involving iris flower measurements.

To follow along with the examples in Python, we'll first import the necessary libraries:

import numpy as np
import pandas as pd
from scipy import linalg
from scipy.stats import f, norm
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Construct the variables for setosa and versicolor
setosa = X[y == 0]  # Data for Iris setosa
versicolor = X[y == 1]  # Data for Iris versicolor

# Optionally, print the shapes of the arrays to verify
print("Setosa data shape:", setosa.shape)
print("Versicolor data shape:", versicolor.shape)

Setosa data shape: (50, 4)
Versicolor data shape: (50, 4)

Arithmetic Procedure: Fisher presents three tables of data:

Table I: Measurements of 4 flower attributes for 50 plants each of 2 iris species (setosa, versicolor)
Table II: Observed means and differences between species for each attribute
Table III: Sums of squares and cross-products of deviations from species means (with 98 degrees of freedom). This is what we call today the cross correlation matrix.

Table I is already loaded from the dataset, let's create the rest of t tables in Python:

# Table II
setosa_means = np.mean(setosa_data, axis=0)
versicolor_means = np.mean(versicolor_data, axis=0)
d = versicolor_means - setosa_means

# Table III
# Assume S is the within-class scatter matrix and d is the difference between the means
# Within-class scatter matrix S
S_setosa = np.cov(setosa, rowvar=False)
S_versicolor = np.cov(versicolor, rowvar=False)
S = S_setosa + S_versicolor

The main objective is to find a linear function X = λ1x1 + λ2x2 + λ3x3 + λ4x4 that maximizes the ratio of the difference between species means to the within-species standard deviation. This leads to solving a system of linear equations for the λ coefficients:

lam = linalg.solve(S, d)
print("Discriminant function coefficients:")
print(lam)

Output:

Discriminant function coefficients:
[-1.52638511 -9.01147968 10.88309735 15.42208247]

So, the discriminating linear function is (after normalization):