Which topics does this article cover?

It highlights machine learning, supervised-learning, unsupervised-learning, reinforcement-learning, Python.

Types of Machine Learning Explained: Supervised vs. Unsupervised vs. Reinforcement Learning

Q: What is "Types of Machine Learning Explained: Supervised vs. Unsupervised vs. Reinforcement Learning" about?

Supervised, unsupervised, and reinforcement learning solve different problems entirely. Learn to tell them apart fast, with three real, tested code demos." excerpt: "Same dataset, three different lenses. Here's how to tell which kind of machine learning problem you're actually solving, before you write any code.

Day 1 established the core idea: machine learning means learning a function from data instead of writing rules by hand. That raises an obvious next question — learning what, exactly, and from what kind of data? That question splits into four distinct answers, and today is about telling them apart cleanly enough to never confuse them again.

Learning Objectives

By the end of this article, you'll be able to:

Tell apart supervised, unsupervised, reinforcement, and self-supervised learning by what's given to the algorithm — not by which library function you happen to call.
Train a real classifier and a real clustering model on the same dataset, and see exactly what each one does and doesn't "know."
Build a working reinforcement learning agent from scratch, with no library, and watch it learn through trial and reward.
Recognize, in an interview or at work, which type of ML problem you're actually facing before you reach for an algorithm.

Why This Matters

"Which type of machine learning would you use for this?" shows up in almost every ML interview, and most candidates who stumble don't fail on the algorithm — they fail on misreading the problem. Pick the wrong family and the best implementation in the world won't save you.

It matters at work, too. Teams regularly burn weeks trying to gather "correct answers" for a problem that was never about correct answers in the first place — customer segmentation, anomaly detection, content ranking. These are jobs for unsupervised learning or feedback-driven learning, not for hand-labeled examples.

It also sets up everything that comes later in this series. Most of the algorithms introduced in Phases 2 and 3 belong to the supervised learning family — though several Phase 3 topics, like cross-validation and feature engineering, are workflow skills that apply across every type, not supervised-only techniques. Phase 6 — LLMs, fine-tuning, RLHF — leans directly on the self-supervised and reinforcement learning ideas you'll meet for the first time today. Get the shape of these four problem types right now, and the rest of the series slots into a structure instead of arriving as forty unrelated topics.

Mental Model

Picture four different ways of learning your way around a new city:

Supervised learning — someone hands you a fully labeled map, with every street name written on it, and you study it until you can navigate confidently.
Unsupervised learning — you wander the city with no map and no labels at all, but you start noticing patterns anyway: "these blocks feel like downtown, these feel residential." You're grouping things by similarity, with nobody telling you the group names.
Reinforcement learning — you have no map and no labels, but every turn you take either gets you closer to your destination or further away, and a "closer/further" signal is all you have to adjust your route over many trips.
Self-supervised learning — you don't have a map, but you do have city photos. You cover up random street signs in your own photos and quiz yourself on guessing them from the surrounding buildings and landmarks — turning the city itself into both the question and the answer key.

Hold onto the first three especially. Every hands-on example in this article maps directly onto one of them — the fourth doesn't have a clean, runnable training example for Day 2, so we'll treat it conceptually here and come back to it with real code once we reach Phase 6.

History in 60 Seconds

1957 — Frank Rosenblatt's Perceptron: the first trainable model that learns a decision boundary from labeled examples — the ancestor of supervised learning as we know it.
1959 — Arthur Samuel's checkers program improves itself through self-play, coining the term "machine learning" and previewing reinforcement learning decades early.
1982 — Teuvo Kohonen's self-organizing maps give unsupervised learning one of its first serious neural approaches to finding structure in unlabeled data.
1989 — Chris Watkins formalizes Q-learning, the algorithm whose core update rule you'll implement by hand later in this article.
1992 — Gerald Tesauro's TD-Gammon learns backgammon at a strong human level using reinforcement learning, decades before AlphaGo.
2012 — AlexNet wins ImageNet, cementing supervised deep learning as the dominant approach to perception tasks for the next several years.
2016 — AlphaGo, combining deep learning with reinforcement learning, beats Lee Sedol at Go — a result many researchers expected to take another decade.
2018 onward — BERT and GPT popularize self-supervised pretraining: the model invents its own labels from raw text, removing the human-labeling bottleneck almost entirely.

Key Terminology Table

Term	Meaning
Supervised learning	Learning a mapping from inputs to known, human-provided correct outputs
Unsupervised learning	Finding structure or groupings in data with no provided labels
Reinforcement learning	Learning a strategy by acting in an environment and receiving rewards or penalties
Self-supervised learning	Generating training labels automatically from the data itself, with no human labeling
Agent	The decision-maker in a reinforcement learning problem
Environment	Everything the agent interacts with and receives feedback from
Reward signal	The numeric feedback an RL agent uses to judge whether an action was good
Policy	The agent's strategy for choosing actions, given what it currently knows
Cluster	A group of data points an unsupervised algorithm judges to be similar to each other
Exploration vs. exploitation	The tension between trying new actions to learn more, and using the best action found so far

Core Concepts

Supervised Learning: Learning From Answered Examples

In supervised learning, every training example comes with the correct answer attached. The algorithm's job is to find a function f that maps inputs x to outputs y, by minimizing how wrong its predictions are across all the labeled examples it's shown.

That "how wrong" has a name: a loss function. For classification, a common one is cross-entropy loss; conceptually, it's just a number that gets bigger the further a prediction is from the true label, and the entire training process is nothing more than search for the parameters that make that number as small as possible.

Two flavors exist depending on what kind of answer you're predicting:

Classification — the answer is a category (spam/not spam, which species of flower).
Regression — the answer is a number on a continuous scale (a price, a temperature).

Unsupervised Learning: Learning From Unlabeled Examples

Here, there's no correct answer provided at all — just raw data, and a goal of finding structure inside it. The most common version is clustering: grouping data points so that points inside a group are more similar to each other than to points in other groups.

The classic algorithm, k-means, defines "similar" mathematically: it tries to choose cluster centers so that the total squared distance between each point and its assigned center is as small as possible. There's no label-checking anywhere in that objective — it's purely about geometric closeness in your feature space.

This is also why you can't grade unsupervised learning the way you grade supervised learning. There's no "correct" clustering to compare against, only "useful" or "not useful" ones — a distinction you'll feel directly in this article's hands-on example.

Reinforcement Learning: Learning From Consequences

Reinforcement learning throws out the idea of a fixed dataset entirely. Instead, you have an agent that takes actions inside an environment, and after each action, the environment returns a reward (or penalty) and a new situation. The agent's goal is to find a policy — a strategy for picking actions — that maximizes its total reward over time, not just the reward from the next single action.

The simplest possible version of this is a multi-armed bandit: imagine several slot machines, each with an unknown, fixed probability of paying out. You don't know which machine is best. Every pull teaches you a little more, but every pull spent on a bad machine is also a pull you can't spend on testing a better one. That's the exploration-exploitation tradeoff, and it's the central tension in all of reinforcement learning, not just bandits.

The learning rule we'll implement is almost embarrassingly simple. After every pull of arm a, update your estimate of that arm's value:

new_estimate = old_estimate + (reward - old_estimate) / times_pulled

That's it — no neural network, no gradient descent. You nudge your estimate toward whatever reward you just observed, by an amount that shrinks as you gather more evidence. Q-learning and most of modern reinforcement learning are elaborations on this exact idea, applied to sequences of actions instead of single pulls.

Self-Supervised Learning: Manufacturing Your Own Labels

This one trips people up because it sounds like unsupervised learning, but it isn't. Self-supervised learning still trains on input-output pairs, exactly like supervised learning — the difference is that the labels are generated automatically from the raw data, with no human involved.

The classic example: take a sentence, hide one word, and train a model to predict the hidden word from the words around it. The "label" is just the word you hid — manufactured for free from text that was lying around unlabeled. Do this across billions of sentences and you get a model that has learned a great deal about language structure, without anyone hand-labeling a single example.

This is exactly why self-supervised learning matters as much as it does right now: supervised learning's biggest constraint is always the supply of human-labeled data, and the internet has an effectively unlimited supply of raw, unlabeled text but a comparatively tiny supply of text labeled the way supervised learning needs. Self-supervised pretraining sidesteps that constraint entirely — it's the foundational training method behind most modern language models, and it's the reason those models could be trained at a scale human labeling could never have supported. We'll build on this directly in Phase 6.

Visual Explanations

The fastest way to scan all four types at once:

Type	Labels?	Goal	Example
Supervised	Yes, human-provided	Predict a known answer	Spam detection
Unsupervised	No	Discover structure	Customer segmentation
Reinforcement	No — a reward signal instead	Learn a strategy	Game-playing agents
Self-supervised	Auto-generated from the data	Learn a representation	LLM pretraining

Here's the shape of what's "given" versus what's "learned" for each type:

flowchart TD
    Sup[Supervised Learning] --> SupQ["Given: inputs + correct answers
Learn: a mapping from input to answer"]
    Uns[Unsupervised Learning] --> UnsQ["Given: inputs only
Learn: hidden structure or grouping"]
    Rein[Reinforcement Learning] --> ReiQ["Given: an environment + reward signal
Learn: a strategy that maximizes reward"]
    Self[Self-Supervised Learning] --> SelQ["Given: raw unlabeled data
Learn: by predicting parts of the data from other parts"]

If you're facing a new problem and aren't sure which type it is, walk through this:

flowchart TD
    Q1{Do you have human-provided<br/>correct answers for examples?} -->|Yes| Sup3[Supervised Learning]
    Q1 -->|No| Q2{Can you generate labels<br/>automatically from the raw data?}
    Q2 -->|Yes| Self3[Self-Supervised Learning]
    Q2 -->|No| Q3{Do you have an environment<br/>that returns rewards over time?}
    Q3 -->|Yes| Rein3[Reinforcement Learning]
    Q3 -->|No, just raw data,<br/>looking for structure| Uns3[Unsupervised Learning]

The reinforcement learning loop specifically looks different from the other three — there's no fixed dataset, just a continuous cycle:

flowchart LR
    A[Agent] -->|chooses an action| E[Environment]
    E -->|new state| A
    E -->|reward or penalty| A

And here's the actual plan for today's hands-on section — one dataset, viewed through two different lenses, plus a separate environment for the third:

flowchart LR
    Iris[Iris flower dataset] --> WL[Species labels visible] --> Sup2[Supervised: train a classifier]
    Iris --> WoL[Species labels hidden] --> Uns2[Unsupervised: run clustering]
    Bandit[Simulated slot machines] --> RL2[Reinforcement: learn through trial and reward]

Hands-On Example

We're going to look at the same flower-measurement dataset — the Iris dataset, 150 flowers measured by petal and sepal length and width — through two completely different lenses, then switch problems entirely for the third.

With species labels visible, we'll train a classifier to predict species from measurements. This is supervised learning — the textbook case.
With species labels hidden, we'll run a clustering algorithm and see whether it can rediscover the species groupings on its own, using nothing but the measurements. This is unsupervised learning.
Switching problems entirely, we'll simulate a row of slot machines with unknown payout rates and build an agent that learns, purely through trial and reward, which one pays best. This is reinforcement learning — and notice it doesn't use the flower dataset at all. That's deliberate: RL isn't a different algorithm applied to the same kind of dataset, it's a fundamentally different kind of problem.

Environment Setup

python -m venv venv

source venv/bin/activate
# Windows
venv\Scripts\activate

pip install numpy pandas scikit-learn

Complete Working Code

Every number quoted in the breakdown below came from actually running this script — nothing here is invented.

"""
day_02_types_of_ml.py
Three problem types, demonstrated honestly: a real supervised classifier,
a real unsupervised clustering run, and a reinforcement learning agent
built from scratch.
"""

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, adjusted_rand_score

iris = load_iris()
X, y = iris.data, iris.target

# --- 1. SUPERVISED LEARNING: species labels are visible during training ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
sup_accuracy = accuracy_score(y_test, preds)

print("=== SUPERVISED ===")
print(f"Test accuracy: {sup_accuracy:.3f}")

# --- 2. UNSUPERVISED LEARNING: species labels are never shown to the model ---
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_assignments = kmeans.fit_predict(X)  # note: y is never passed in here
ari = adjusted_rand_score(y, cluster_assignments)  # y used only to *score* afterward

print("\n=== UNSUPERVISED ===")
print(f"Adjusted Rand Index (cluster quality vs. true species): {ari:.3f}")

# --- 3. REINFORCEMENT LEARNING: a multi-armed bandit, learned from scratch ---
np.random.seed(42)
true_reward_probs = [0.20, 0.50, 0.35]   # unknown to the agent
n_arms = len(true_reward_probs)
n_rounds = 2000
epsilon = 0.1                            # 10% of the time, explore at random

estimated_values = np.zeros(n_arms)
pull_counts = np.zeros(n_arms)
total_reward = 0

for t in range(n_rounds):
    if np.random.rand() < epsilon:
        arm = np.random.randint(n_arms)          # explore
    else:
        arm = np.argmax(estimated_values)        # exploit the best-known arm

    reward = 1 if np.random.rand() < true_reward_probs[arm] else 0
    pull_counts[arm] += 1
    estimated_values[arm] += (reward - estimated_values[arm]) / pull_counts[arm]
    total_reward += reward

print("\n=== REINFORCEMENT LEARNING ===")
print(f"True reward probabilities:   {true_reward_probs}")
print(f"Learned value estimates:     {np.round(estimated_values, 3).tolist()}")
print(f"Times each arm was pulled:   {pull_counts.astype(int).tolist()}")
print(f"Total reward: {total_reward} / {n_rounds} rounds")

Running this prints:

=== SUPERVISED ===
Test accuracy: 0.933

=== UNSUPERVISED ===
Adjusted Rand Index (cluster quality vs. true species): 0.730

=== REINFORCEMENT LEARNING ===
True reward probabilities:   [0.2, 0.5, 0.35]
Learned value estimates:     [0.184, 0.491, 0.314]
Times each arm was pulled:   [234, 1696, 70]
Total reward: 898 / 2000 rounds

Code Breakdown

The supervised block. train_test_split(..., stratify=y) keeps the species ratio consistent in both splits. LogisticRegression().fit(X_train, y_train) is the entire "training" step — internally it's searching for parameters that minimize prediction error on the labeled training examples. Result: 93.3% accuracy on flowers the model never trained on.

The unsupervised block. Notice kmeans.fit_predict(X) never receives y — the true species labels are completely invisible to the algorithm during fitting. We only use y afterward, in adjusted_rand_score, purely to grade how well the clusters the algorithm found happen to line up with species. An Adjusted Rand Index of 0.73 (1.0 would be a perfect match, 0.0 would be random) tells us k-means found real structure in the measurements — but imperfectly, because it was never trying to match species in the first place. It was only ever trying to minimize within-cluster distance. That gap between "found real structure" and "matched our human categories exactly" is the central, honest limitation of unsupervised learning.

The reinforcement learning block. np.argmax(estimated_values) is the "exploit" move — always picking the arm that currently looks best. The np.random.rand() < epsilon check is the "explore" move, firing 10% of the time regardless of what looks best so far. The update line is the incremental average formula from Core Concepts, applied live. Look at the final pull counts: [234, 1696, 70]. The agent pulled arm 1 (true payout rate 0.50, the actual best arm) over 1,600 times out of 2,000 — and its learned estimate for that arm, 0.491, lands almost exactly on the true probability, 0.50, despite the agent never being told what that true probability was.

Common Mistakes

Assuming every ML problem needs labels. If you find yourself trying to manufacture "correct answers" for a segmentation or grouping problem, stop — that's very often a sign you actually have an unsupervised problem.
Judging clustering by "accuracy" against labels it never saw. An ARI of 0.73 isn't a failing grade. The algorithm was never optimizing to match your labels — it was optimizing geometric closeness. Don't grade a model on a goal it was never given.
Treating reinforcement learning like supervised learning. There's no "correct action" given to an RL agent at each step — only a reward after the fact. Beginners often expect RL training to look like a labeled dataset, then get confused when it doesn't.
Forgetting to explore. Set epsilon = 0 in the code above and the agent locks onto whichever arm it tries well first, even if that arm isn't actually the best one — a textbook case of premature exploitation.
Confusing self-supervised with unsupervised. Both skip human labels, but self-supervised learning still trains on input-output pairs — the labels are just generated automatically from the data. This distinction comes up constantly in interviews about LLM pretraining.
Picking the number of clusters and treating it as "discovered." n_clusters=3 was a choice we made because we happened to know there were 3 species. In a real unsupervised problem, you don't get to peek at the answer — choosing k honestly is its own (important) skill, covered later in this series.

Best Practices

Diagnose the problem type before touching an algorithm. The decision framework in the Visual Explanations section above takes about ten seconds to run through — and ten seconds spent there can save you weeks spent building the wrong kind of system.
Pair quantitative and qualitative checks for unsupervised work. When ground truth happens to exist (like species labels here), use it to sanity-check your clusters — but don't treat it as the objective the algorithm was solving for.
Tune exploration deliberately in RL, don't default to a fixed epsilon forever. Many production bandit systems decay epsilon over time: explore heavily early, exploit more as confidence grows.
Start with the simplest version of each problem type before reaching for its deep-learning cousin. Logistic regression, k-means, and a tabular bandit solve a surprising number of real business problems before you ever need a neural network.

Production Perspective

Supervised learning in production is the overwhelming majority of deployed ML — fraud scoring, spam filtering, ranking, demand forecasting. The real bottleneck is almost never the algorithm; it's the labeling pipeline. Someone (or some process) has to keep producing correct answers, continuously, as the underlying behavior shifts — and label quality monitoring is its own ongoing job.

Unsupervised learning in production has a much harder evaluation story. There's no ground truth in production to check against, so teams typically pair the model's output with periodic human review (does this customer segmentation still make business sense?) and watch for cluster drift as behavior changes over time, rather than relying on a single accuracy number the way supervised systems do.

Reinforcement learning in production is rare, and for a good reason: a real RL agent learns by acting, which means early in training it will make bad decisions on real users or real systems. Most companies avoid this by training in a simulator first, or — far more commonly — using a lighter-weight cousin of RL called a contextual bandit, which makes one-shot decisions (which ad to show, which headline to test) instead of long sequences of decisions, and is dramatically cheaper and safer to run live.

The bottleneck moves, it never disappears. Supervised learning's bottleneck is labeling. Unsupervised learning's bottleneck is evaluation. Reinforcement learning's bottleneck is safe exploration. Knowing which bottleneck you're signing up for, before you commit to a problem type, is the difference between an engineer who ships and one who's still arguing about algorithms in week three.

Cost shows up differently in each. Supervised learning's biggest cost is usually human labeling. Unsupervised learning skips that cost but spends it instead on evaluation effort. Reinforcement learning's biggest cost is often the simulator or the "cost of exploring badly" on real traffic.

Real-World Applications

Supervised — credit scoring, medical image classifiers, the spam filter from Day 1, house price prediction.
Unsupervised — customer segmentation for marketing teams, anomaly detection in network security logs, topic discovery across large document collections.
Reinforcement learning — game-playing agents (AlphaGo, Atari-playing systems), robotics control, datacenter cooling optimization, and — closer to most engineers' daily reality — contextual bandits powering ad and content ranking.
Self-supervised — pretraining for today's large language models and vision models, which we'll cover directly in Phase 6.

Interview Questions

1. What's the core difference between supervised and unsupervised learning, in terms of what's given to the algorithm? Supervised learning is given labeled input-output pairs and learns a mapping between them. Unsupervised learning is given only inputs and must find structure — like groupings — on its own.

2. How would you decide whether a business problem needs supervised or unsupervised learning? Ask whether you have, or can realistically obtain, correct-answer labels for your training examples. If yes, and the goal is to predict that label for new data, it's supervised. If you're instead trying to discover natural groupings or structure with no predefined answer, it's unsupervised.

3. What is the exploration-exploitation tradeoff, and why does it matter? It's the tension between trying new, uncertain actions to gather more information (exploration) and using the action currently believed to be best (exploitation). Too much exploration wastes opportunities on known-bad options; too little risks settling on a mediocre option you never tested past.

4. Why can't you measure "accuracy" for a clustering model the same way you do for a classifier? There's no ground-truth label the clustering algorithm was trying to match — it was optimizing a geometric objective like within-cluster distance. Metrics like the Adjusted Rand Index can compare clusters to known categories when those happen to exist, but that's a diagnostic check, not the model's actual training objective.

5. How is self-supervised learning different from unsupervised learning, given that neither uses human-provided labels? Self-supervised learning still trains on input-output pairs — it just generates the labels automatically from the data itself (for example, predicting a hidden word from its context). Unsupervised learning has no labels of any kind, generated or human-provided; it's purely about discovering structure.

6. Give a real example where reinforcement learning would be a poor choice, even though it's technically applicable. Any setting where bad early decisions are costly or irreversible and no good simulator exists — for example, directly RL-training a pricing policy on live customers with no sandbox first, where a long stretch of "exploration" could mean real financial losses or reputational damage before the agent learns anything useful.

7. What does a reward signal need to satisfy for an RL problem to be well-posed? It needs to be measurable, attributable to the agent's actions (even if delayed), and aligned with the actual long-term goal — a poorly designed reward (one that's easy to maximize without achieving the real objective) is a classic, hard-to-debug RL failure mode.

8. How do contextual bandits relate to full reinforcement learning, and why are they more common in production? A contextual bandit makes a single decision per round based on the current context, with no concept of a longer multi-step sequence affecting future state. Full RL handles sequential decisions where actions affect future situations. Bandits are simpler to reason about, faster to validate, and lower-risk to run on live traffic, which is why they're far more common in production recommendation and ad systems than full RL.

Self-Assessment

Can you explain, in your own words, why you can't grade a clustering model's "accuracy" the same way you grade a classifier?
Can you implement the exploration-exploitation logic from this article's bandit, from scratch, without looking at the code again?
Can you name a real-world problem at your own company or project that's secretly unsupervised, even though someone is currently trying to solve it with labels?
Can you explain why self-supervised learning isn't simply "unsupervised learning with extra steps"?

Portfolio Challenges

Beginner Challenge

Re-run the supervised vs. unsupervised comparison from this article using the Wine or Breast Cancer dataset (both load instantly via sklearn.datasets) instead of Iris. Report the classifier's test accuracy and the clustering's Adjusted Rand Index, and write two sentences on whether clustering found the "natural" groups as cleanly as it did for Iris.

Intermediate Challenge

Extend the bandit to 5 arms with reward probabilities of your choosing, and run the experiment three times with epsilon = 0.01, 0.1, and 0.5. Plot or tabulate total reward for each setting, and explain in your own words why the middle value usually wins.

Advanced Challenge

Implement a simple contextual bandit: give each round a random binary "context" (0 or 1), and make each arm's true reward probability depend on that context (so the best arm differs depending on context). Adapt the agent so it learns a separate value estimate per (context, arm) pair, and compare its total reward against a context-blind bandit run on the same data.

Advanced Insights

The four categories in this article aren't fully separate in modern practice — they're increasingly combined. The most consequential example right now: RLHF (Reinforcement Learning from Human Feedback), the technique used to align large language models with human preferences. It's literally reinforcement learning — an agent (the language model), actions (generated text), and a reward signal (a learned model of human preference) — applied on top of a model that was originally trained with self-supervised learning. Today's most capable AI systems are quite literally built by stacking all four learning types from this article on top of each other, in sequence.

It's also worth knowing that semi-supervised learning — training with a small amount of labeled data plus a large amount of unlabeled data — exists as a practical middle ground, common in domains where labeling is expensive but raw data is abundant (medical imaging is a frequent example). We'll touch on it again once we reach feature engineering and model evaluation in Phase 3.

Key Takeaways

The four ML types are distinguished by what's given to the algorithm, not by which algorithm or library is used: labeled pairs (supervised), unlabeled data alone (unsupervised), an environment with rewards (reinforcement), or self-generated labels from raw data (self-supervised).
You cannot evaluate unsupervised learning the way you evaluate supervised learning — there's no ground truth the algorithm was optimizing toward.
The exploration-exploitation tradeoff is the defining challenge of reinforcement learning, and it shows up the moment you set an exploration rate to zero.
Self-supervised learning is not the same as unsupervised learning — it still has labels, just machine-generated ones.
Production reality differs sharply across types: supervised learning's bottleneck is labeling, unsupervised learning's bottleneck is evaluation, and reinforcement learning's bottleneck is safe exploration.

What's Next

Day 3 moves one level deeper: Features and Labels — what actually makes a feature "good," why label quality usually matters more than algorithm choice, and how to spot a feature that's secretly leaking the answer to your model before you ever ship it.

References

Beginner

Google: Machine Learning Crash Course — covers all four learning types this article introduces, with interactive exercises.
scikit-learn: User Guide — official documentation for the LogisticRegression and KMeans implementations used in this article's code.

Intermediate

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning — the authors provide the full text free at statlearning.com, with strong coverage of the statistical foundations behind supervised learning.

Professional

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly) — a thorough, code-first treatment of everything in this article and well beyond it. Available through O'Reilly and major booksellers.

Advanced

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (2nd edition, MIT Press) — the field's canonical text, and the direct source for the bandit algorithm built in this article. The authors host the full text on Sutton's official book page: incompleteideas.net/book/the-book-2nd.html.

Learning Objectives

By the end of this article, you'll be able to:

Tell apart supervised, unsupervised, reinforcement, and self-supervised learning by what's given to the algorithm — not by which library function you happen to call.
Train a real classifier and a real clustering model on the same dataset, and see exactly what each one does and doesn't "know."
Build a working reinforcement learning agent from scratch, with no library, and watch it learn through trial and reward.
Recognize, in an interview or at work, which type of ML problem you're actually facing before you reach for an algorithm.

Why This Matters

Mental Model

Picture four different ways of learning your way around a new city:

Supervised learning — someone hands you a fully labeled map, with every street name written on it, and you study it until you can navigate confidently.
Unsupervised learning — you wander the city with no map and no labels at all, but you start noticing patterns anyway: "these blocks feel like downtown, these feel residential." You're grouping things by similarity, with nobody telling you the group names.
Reinforcement learning — you have no map and no labels, but every turn you take either gets you closer to your destination or further away, and a "closer/further" signal is all you have to adjust your route over many trips.
Self-supervised learning — you don't have a map, but you do have city photos. You cover up random street signs in your own photos and quiz yourself on guessing them from the surrounding buildings and landmarks — turning the city itself into both the question and the answer key.

History in 60 Seconds

1957 — Frank Rosenblatt's Perceptron: the first trainable model that learns a decision boundary from labeled examples — the ancestor of supervised learning as we know it.
1959 — Arthur Samuel's checkers program improves itself through self-play, coining the term "machine learning" and previewing reinforcement learning decades early.
1982 — Teuvo Kohonen's self-organizing maps give unsupervised learning one of its first serious neural approaches to finding structure in unlabeled data.
1989 — Chris Watkins formalizes Q-learning, the algorithm whose core update rule you'll implement by hand later in this article.
1992 — Gerald Tesauro's TD-Gammon learns backgammon at a strong human level using reinforcement learning, decades before AlphaGo.
2012 — AlexNet wins ImageNet, cementing supervised deep learning as the dominant approach to perception tasks for the next several years.
2016 — AlphaGo, combining deep learning with reinforcement learning, beats Lee Sedol at Go — a result many researchers expected to take another decade.
2018 onward — BERT and GPT popularize self-supervised pretraining: the model invents its own labels from raw text, removing the human-labeling bottleneck almost entirely.

Key Terminology Table

Term	Meaning
Supervised learning	Learning a mapping from inputs to known, human-provided correct outputs
Unsupervised learning	Finding structure or groupings in data with no provided labels
Reinforcement learning	Learning a strategy by acting in an environment and receiving rewards or penalties
Self-supervised learning	Generating training labels automatically from the data itself, with no human labeling
Agent	The decision-maker in a reinforcement learning problem
Environment	Everything the agent interacts with and receives feedback from
Reward signal	The numeric feedback an RL agent uses to judge whether an action was good
Policy	The agent's strategy for choosing actions, given what it currently knows
Cluster	A group of data points an unsupervised algorithm judges to be similar to each other
Exploration vs. exploitation	The tension between trying new actions to learn more, and using the best action found so far

Core Concepts

Supervised Learning: Learning From Answered Examples

Two flavors exist depending on what kind of answer you're predicting:

Classification — the answer is a category (spam/not spam, which species of flower).
Regression — the answer is a number on a continuous scale (a price, a temperature).

Unsupervised Learning: Learning From Unlabeled Examples

Reinforcement Learning: Learning From Consequences

The learning rule we'll implement is almost embarrassingly simple. After every pull of arm a, update your estimate of that arm's value:

new_estimate = old_estimate + (reward - old_estimate) / times_pulled

Self-Supervised Learning: Manufacturing Your Own Labels

Visual Explanations

The fastest way to scan all four types at once:

Type	Labels?	Goal	Example
Supervised	Yes, human-provided	Predict a known answer	Spam detection
Unsupervised	No	Discover structure	Customer segmentation
Reinforcement	No — a reward signal instead	Learn a strategy	Game-playing agents
Self-supervised	Auto-generated from the data	Learn a representation	LLM pretraining

Here's the shape of what's "given" versus what's "learned" for each type:

flowchart TD
    Sup[Supervised Learning] --> SupQ["Given: inputs + correct answers
Learn: a mapping from input to answer"]
    Uns[Unsupervised Learning] --> UnsQ["Given: inputs only
Learn: hidden structure or grouping"]
    Rein[Reinforcement Learning] --> ReiQ["Given: an environment + reward signal
Learn: a strategy that maximizes reward"]
    Self[Self-Supervised Learning] --> SelQ["Given: raw unlabeled data
Learn: by predicting parts of the data from other parts"]

If you're facing a new problem and aren't sure which type it is, walk through this:

flowchart TD
    Q1{Do you have human-provided<br/>correct answers for examples?} -->|Yes| Sup3[Supervised Learning]
    Q1 -->|No| Q2{Can you generate labels<br/>automatically from the raw data?}
    Q2 -->|Yes| Self3[Self-Supervised Learning]
    Q2 -->|No| Q3{Do you have an environment<br/>that returns rewards over time?}
    Q3 -->|Yes| Rein3[Reinforcement Learning]
    Q3 -->|No, just raw data,<br/>looking for structure| Uns3[Unsupervised Learning]

The reinforcement learning loop specifically looks different from the other three — there's no fixed dataset, just a continuous cycle:

flowchart LR
    A[Agent] -->|chooses an action| E[Environment]
    E -->|new state| A
    E -->|reward or penalty| A

And here's the actual plan for today's hands-on section — one dataset, viewed through two different lenses, plus a separate environment for the third:

flowchart LR
    Iris[Iris flower dataset] --> WL[Species labels visible] --> Sup2[Supervised: train a classifier]
    Iris --> WoL[Species labels hidden] --> Uns2[Unsupervised: run clustering]
    Bandit[Simulated slot machines] --> RL2[Reinforcement: learn through trial and reward]

Hands-On Example

With species labels visible, we'll train a classifier to predict species from measurements. This is supervised learning — the textbook case.
With species labels hidden, we'll run a clustering algorithm and see whether it can rediscover the species groupings on its own, using nothing but the measurements. This is unsupervised learning.
Switching problems entirely, we'll simulate a row of slot machines with unknown payout rates and build an agent that learns, purely through trial and reward, which one pays best. This is reinforcement learning — and notice it doesn't use the flower dataset at all. That's deliberate: RL isn't a different algorithm applied to the same kind of dataset, it's a fundamentally different kind of problem.

Environment Setup

python -m venv venv

source venv/bin/activate
# Windows
venv\Scripts\activate

pip install numpy pandas scikit-learn

Complete Working Code

Every number quoted in the breakdown below came from actually running this script — nothing here is invented.

"""
day_02_types_of_ml.py
Three problem types, demonstrated honestly: a real supervised classifier,
a real unsupervised clustering run, and a reinforcement learning agent
built from scratch.
"""

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, adjusted_rand_score

iris = load_iris()
X, y = iris.data, iris.target

# --- 1. SUPERVISED LEARNING: species labels are visible during training ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
sup_accuracy = accuracy_score(y_test, preds)

print("=== SUPERVISED ===")
print(f"Test accuracy: {sup_accuracy:.3f}")

# --- 2. UNSUPERVISED LEARNING: species labels are never shown to the model ---
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_assignments = kmeans.fit_predict(X)  # note: y is never passed in here
ari = adjusted_rand_score(y, cluster_assignments)  # y used only to *score* afterward

print("\n=== UNSUPERVISED ===")
print(f"Adjusted Rand Index (cluster quality vs. true species): {ari:.3f}")

# --- 3. REINFORCEMENT LEARNING: a multi-armed bandit, learned from scratch ---
np.random.seed(42)
true_reward_probs = [0.20, 0.50, 0.35]   # unknown to the agent
n_arms = len(true_reward_probs)
n_rounds = 2000
epsilon = 0.1                            # 10% of the time, explore at random

estimated_values = np.zeros(n_arms)
pull_counts = np.zeros(n_arms)
total_reward = 0

for t in range(n_rounds):
    if np.random.rand() < epsilon:
        arm = np.random.randint(n_arms)          # explore
    else:
        arm = np.argmax(estimated_values)        # exploit the best-known arm

    reward = 1 if np.random.rand() < true_reward_probs[arm] else 0
    pull_counts[arm] += 1
    estimated_values[arm] += (reward - estimated_values[arm]) / pull_counts[arm]
    total_reward += reward

print("\n=== REINFORCEMENT LEARNING ===")
print(f"True reward probabilities:   {true_reward_probs}")
print(f"Learned value estimates:     {np.round(estimated_values, 3).tolist()}")
print(f"Times each arm was pulled:   {pull_counts.astype(int).tolist()}")
print(f"Total reward: {total_reward} / {n_rounds} rounds")

Running this prints:

=== SUPERVISED ===
Test accuracy: 0.933

=== UNSUPERVISED ===
Adjusted Rand Index (cluster quality vs. true species): 0.730

=== REINFORCEMENT LEARNING ===
True reward probabilities:   [0.2, 0.5, 0.35]
Learned value estimates:     [0.184, 0.491, 0.314]
Times each arm was pulled:   [234, 1696, 70]
Total reward: 898 / 2000 rounds

Code Breakdown

Common Mistakes

Assuming every ML problem needs labels. If you find yourself trying to manufacture "correct answers" for a segmentation or grouping problem, stop — that's very often a sign you actually have an unsupervised problem.
Judging clustering by "accuracy" against labels it never saw. An ARI of 0.73 isn't a failing grade. The algorithm was never optimizing to match your labels — it was optimizing geometric closeness. Don't grade a model on a goal it was never given.
Treating reinforcement learning like supervised learning. There's no "correct action" given to an RL agent at each step — only a reward after the fact. Beginners often expect RL training to look like a labeled dataset, then get confused when it doesn't.
Forgetting to explore. Set epsilon = 0 in the code above and the agent locks onto whichever arm it tries well first, even if that arm isn't actually the best one — a textbook case of premature exploitation.
Confusing self-supervised with unsupervised. Both skip human labels, but self-supervised learning still trains on input-output pairs — the labels are just generated automatically from the data. This distinction comes up constantly in interviews about LLM pretraining.
Picking the number of clusters and treating it as "discovered." n_clusters=3 was a choice we made because we happened to know there were 3 species. In a real unsupervised problem, you don't get to peek at the answer — choosing k honestly is its own (important) skill, covered later in this series.

Best Practices

Diagnose the problem type before touching an algorithm. The decision framework in the Visual Explanations section above takes about ten seconds to run through — and ten seconds spent there can save you weeks spent building the wrong kind of system.
Pair quantitative and qualitative checks for unsupervised work. When ground truth happens to exist (like species labels here), use it to sanity-check your clusters — but don't treat it as the objective the algorithm was solving for.
Tune exploration deliberately in RL, don't default to a fixed epsilon forever. Many production bandit systems decay epsilon over time: explore heavily early, exploit more as confidence grows.
Start with the simplest version of each problem type before reaching for its deep-learning cousin. Logistic regression, k-means, and a tabular bandit solve a surprising number of real business problems before you ever need a neural network.

Production Perspective

The bottleneck moves, it never disappears. Supervised learning's bottleneck is labeling. Unsupervised learning's bottleneck is evaluation. Reinforcement learning's bottleneck is safe exploration. Knowing which bottleneck you're signing up for, before you commit to a problem type, is the difference between an engineer who ships and one who's still arguing about algorithms in week three.

Real-World Applications

Supervised — credit scoring, medical image classifiers, the spam filter from Day 1, house price prediction.
Unsupervised — customer segmentation for marketing teams, anomaly detection in network security logs, topic discovery across large document collections.
Reinforcement learning — game-playing agents (AlphaGo, Atari-playing systems), robotics control, datacenter cooling optimization, and — closer to most engineers' daily reality — contextual bandits powering ad and content ranking.
Self-supervised — pretraining for today's large language models and vision models, which we'll cover directly in Phase 6.

Interview Questions

Self-Assessment

Can you explain, in your own words, why you can't grade a clustering model's "accuracy" the same way you grade a classifier?
Can you implement the exploration-exploitation logic from this article's bandit, from scratch, without looking at the code again?
Can you name a real-world problem at your own company or project that's secretly unsupervised, even though someone is currently trying to solve it with labels?
Can you explain why self-supervised learning isn't simply "unsupervised learning with extra steps"?

Portfolio Challenges

Beginner Challenge

Intermediate Challenge

Advanced Challenge

Advanced Insights

Key Takeaways

The four ML types are distinguished by what's given to the algorithm, not by which algorithm or library is used: labeled pairs (supervised), unlabeled data alone (unsupervised), an environment with rewards (reinforcement), or self-generated labels from raw data (self-supervised).
You cannot evaluate unsupervised learning the way you evaluate supervised learning — there's no ground truth the algorithm was optimizing toward.
The exploration-exploitation tradeoff is the defining challenge of reinforcement learning, and it shows up the moment you set an exploration rate to zero.
Self-supervised learning is not the same as unsupervised learning — it still has labels, just machine-generated ones.
Production reality differs sharply across types: supervised learning's bottleneck is labeling, unsupervised learning's bottleneck is evaluation, and reinforcement learning's bottleneck is safe exploration.

What's Next

References

Beginner

Google: Machine Learning Crash Course — covers all four learning types this article introduces, with interactive exercises.
scikit-learn: User Guide — official documentation for the LogisticRegression and KMeans implementations used in this article's code.

Intermediate

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning — the authors provide the full text free at statlearning.com, with strong coverage of the statistical foundations behind supervised learning.

Professional

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly) — a thorough, code-first treatment of everything in this article and well beyond it. Available through O'Reilly and major booksellers.

Advanced

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (2nd edition, MIT Press) — the field's canonical text, and the direct source for the bandit algorithm built in this article. The authors host the full text on Sutton's official book page: incompleteideas.net/book/the-book-2nd.html.

Types of Machine Learning Explained: Supervised vs. Unsupervised vs. Reinforcement Learning

Learning Objectives

Why This Matters

Mental Model

History in 60 Seconds

Key Terminology Table

Core Concepts

Supervised Learning: Learning From Answered Examples

Unsupervised Learning: Learning From Unlabeled Examples

Reinforcement Learning: Learning From Consequences

Self-Supervised Learning: Manufacturing Your Own Labels

Visual Explanations

Hands-On Example

Environment Setup

Complete Working Code

Code Breakdown

Common Mistakes

Best Practices

Production Perspective

Real-World Applications

Interview Questions

Self-Assessment

Portfolio Challenges

Beginner Challenge

Intermediate Challenge

Advanced Challenge

Advanced Insights

Key Takeaways

What's Next

References

ZyVOP

Comments (0)

Types of Machine Learning Explained: Supervised vs. Unsupervised vs. Reinforcement Learning

Learning Objectives

Why This Matters

Mental Model

History in 60 Seconds

Key Terminology Table

Core Concepts

Supervised Learning: Learning From Answered Examples

Unsupervised Learning: Learning From Unlabeled Examples

Reinforcement Learning: Learning From Consequences

Self-Supervised Learning: Manufacturing Your Own Labels

Visual Explanations

Hands-On Example

Environment Setup

Complete Working Code

Code Breakdown

Common Mistakes

Best Practices

Production Perspective

Real-World Applications

Interview Questions

Self-Assessment

Portfolio Challenges

Beginner Challenge

Intermediate Challenge

Advanced Challenge

Advanced Insights

Key Takeaways

What's Next

References

ZyVOP

Comments (0)

Related Posts

What Is Machine Learning? A Beginner-to-Pro Guide for 2026

I Thought AI Was Magic Until I Built My Own Model

I Built a Tiny AI Agent From Scratch — Every Line Tested Before It Touched a Real API

AI Agents in 2026: Your No-Fluff Guide to Building One That Actually Works

How to Build Your First AI Agent in 2026 (Without Losing Your Mind)

Popular Tags