The Wozniak Coffee Test: Why AI Still Can’t Make Coffee in Your Kitchen#

Fabricio Ceolin

Independent Researcher

https://www.linkedin.com/in/fabceolin/

Abstract#

Steve Wozniak proposed a deceptively simple AI benchmark: walk into an unfamiliar house, find the coffee, locate the coffee maker, and brew a good cup. Unlike the Turing Test, this demands embodied intelligence—operating in novel environments with incomplete information while executing multi-step physical tasks. This article proposes a neurosymbolic architecture inspired by Kahneman’s dual-process theory and recent advances in game-playing agents. We describe a Blending Module that dynamically arbitrates between neural perception (System 1) and symbolic planning (System 2), enabling robots to explore unknown kitchens, form hypotheses about object locations, and adapt plans when assumptions fail. Using The Edge Agent (TEA) framework, we demonstrate how Prolog-based world modeling combined with LLM-driven perception could finally enable AI to make coffee in your kitchen.

Keywords: Wozniak Coffee Test, Embodied AI, Dual-Process Theory, Neurosymbolic Robotics, Task Planning

1. Introduction#

Steve Wozniak once proposed a deceptively simple benchmark for artificial intelligence: An AI would be truly intelligent when it could walk into an unfamiliar house, find the coffee, locate the coffee maker, and brew a good cup of coffee.

This “Wozniak Coffee Test” has remained unsolved for years — not because we lack powerful models, but because it exposes fundamental gaps in how AI systems reason about the world. Unlike the Turing Test, which measures conversational ability, Wozniak’s test demands embodied intelligence: the ability to operate in novel environments with incomplete information, combine prior knowledge with real-time perception, and execute multi-step physical tasks.

This article proposes a neurosymbolic architecture for solving the Wozniak Coffee Test, inspired by Daniel Kahneman’s dual-process theory and recent advances in neurosymbolic game-playing agents. We’ll explore how combining neural perception (System 1) with symbolic reasoning (System 2) through a dynamic Blending Module could finally enable AI to make coffee in your kitchen.

2. The Challenge: Why Can’t AI Make Coffee?#

Let’s decompose the Wozniak Coffee Test into its fundamental challenges:

1. Absence of Initial Conditions#

The AI enters an unknown environment with no prior mapping:

Where is the kitchen?
Where is the coffee stored?
What type of coffee maker is available?
Where are the cups, water, filters?

2. Commonsense World Models#

The AI must have implicit models of:

What a typical kitchen looks like
Where coffee is usually stored (cabinet, counter, pantry)
The functional relationships between objects (filter → coffee maker → water)
The process of brewing coffee (steps, timing, completion criteria)

3. Object Recognition Under Uncertainty#

The AI must identify:

Coffee (ground, beans, pods?) among similar-looking substances
Coffee maker (drip, French press, espresso, Keurig?)
Filters (paper, metal mesh, built-in?)
Related tools (kettle, grinder, measuring scoop)

4. Functional Reasoning#

Understanding not just what objects are, but how they work:

This knob controls the burner
This button starts the brewing cycle
This lever releases the water reservoir
This light indicates “ready”

5. Completion Detection#

Perhaps the hardest part — knowing when the coffee is “done”:

Drip coffee: Sound changes, dripping stops
French press: Time-based (4 minutes)
Espresso: Color and flow rate change
As Wozniak noted: even “Dona Maria” checks before declaring it ready

        flowchart TB
    subgraph Challenge["Wozniak Test Challenges"]
        A[Unknown Environment] --> B[Find Kitchen]
        B --> C[Locate Coffee]
        C --> D[Identify Coffee Maker Type]
        D --> E[Find Accessories]
        E --> F[Execute Brewing Process]
        F --> G[Detect Completion]
    end

    subgraph Gaps["Current AI Gaps"]
        H[No Embodied Reasoning]
        I[Weak Object Permanence]
        J[Limited Functional Understanding]
        K[No Uncertainty Quantification]
    end

    Challenge -.-> Gaps

3. The Inspiration: Kahneman’s Dual-Process Theory#

Daniel Kahneman’s research on human cognition describes two distinct modes of thinking:

System 1: Fast, Intuitive, Automatic#

Processes sensory input instantly
Pattern matching from experience
Emotional reactions and reflexes
“Hot” cognition — immediate responses

Example: Seeing a stove burner glowing red and instantly knowing “hot, don’t touch.”

System 2: Slow, Deliberate, Logical#

Sequential reasoning and planning
Mathematical calculation
Conscious decision-making
“Cold” cognition — analytical processing

Example: Calculating whether there’s enough coffee for 6 cups given the bag’s remaining weight.

The Key Insight#

Humans don’t choose between systems — we blend them dynamically based on context:

Situation	System 1 Weight	System 2 Weight
Hot oil splatters	95%	5%
Following a new recipe	20%	80%
Routine morning coffee	70%	30%
Coffee maker malfunction	30%	70%

This dynamic blending is what current AI lacks — and what we need to solve the Wozniak Test.

4. The Shindo Architecture: Neurosymbolic Game Agents#

Recent work by Daiki Shindo on neurosymbolic agents for Atari games demonstrates a practical implementation of System 1/System 2 blending:

Architecture Components#

        flowchart LR
    subgraph Input["Sensory Input"]
        P[Pixel Data]
        O[Object Representations]
    end

    subgraph S1["System 1 (Neural)"]
        CNN[CNN Policy]
    end

    subgraph S2["System 2 (Symbolic)"]
        LP[Logical Policy]
    end

    subgraph Blend["Blending Module"]
        BM[Meta-Policy]
    end

    subgraph Output["Action"]
        A[Combined Action]
    end

    P --> CNN
    O --> LP
    CNN --> BM
    LP --> BM
    BM --> A

    style S1 fill:#FFB6C1
    style S2 fill:#90EE90
    style Blend fill:#87CEEB

System 1: Convolutional Neural Network#

Processes raw pixels
Learns reactive policies (dodge enemies, time jumps)
Fast inference (~1ms)
Handles novel visual situations through generalization

System 2: Differentiable Logic#

Operates on object-centric representations (coordinates, velocities)
Implements explicit reasoning (count divers, plan routes)
Slower but precise
Handles compositional goals

The Blending Module#

The innovation is the meta-policy that dynamically allocates weight:

# Pseudocode for blending
def blend(state, s1_action, s2_action):
    # Meta-policy estimates optimal blend ratio
    alpha = blending_network(state)  # 0.0 to 1.0

    # Weighted combination
    action = alpha * s1_action + (1 - alpha) * s2_action
    return action

Results in Games#

Game	Challenge	System 1 Role	System 2 Role	Result
Kangaroo	Save baby at top	Punch blocking monkeys	Plan route up ladders	Solves sparse reward
Seaquest	Rescue divers	Avoid enemies, collect divers	Count divers (6), surface for air	Optimal surfacing

The key finding: joint training of all components (CNN, logic policy, blending module) via end-to-end gradient descent produces emergent behaviors that neither system alone could achieve.

5. Proposed Architecture: Coffee Agent#

We now apply these insights to the Wozniak Coffee Test. Here’s a neurosymbolic architecture that could solve it:

High-Level Design#

        flowchart TB
    subgraph Perception["Perception Layer"]
        V[Vision Model<br/>Object Detection]
        S[Spatial Model<br/>Scene Graph]
        T[Tactile/Force<br/>Sensors]
    end

    subgraph S1["System 1: Neural Policies"]
        NAV[Navigation Policy]
        MAN[Manipulation Policy]
        SAF[Safety Policy]
    end

    subgraph S2["System 2: Symbolic Reasoning"]
        WM[World Model<br/>Prolog KB]
        PL[Planner<br/>HTN/STRIPS]
        MON[Monitor<br/>Completion Detection]
    end

    subgraph BM["Blending Module"]
        META[Context-Aware<br/>Meta-Policy]
        PRIOR[Prior Knowledge<br/>Injection]
    end

    subgraph Action["Motor Control"]
        ACT[Action Execution]
    end

    V --> S1
    V --> S2
    S --> S2
    T --> S1

    S1 --> META
    S2 --> META
    PRIOR --> META
    META --> ACT

    style S1 fill:#FFB6C1
    style S2 fill:#90EE90
    style BM fill:#87CEEB

Component Details#

Perception Layer#

Vision Model: Object detection and recognition

Pre-trained on household objects (ImageNet, COCO)
Fine-tuned on kitchen-specific items
Outputs: bounding boxes, object classes, confidence scores

Spatial Model: Scene understanding

Constructs scene graphs (object relationships)
Tracks object permanence (coffee is in cabinet even when door closed)
Outputs: spatial predicates for System 2

Tactile Sensors: Physical feedback

Force sensing for manipulation
Temperature sensing for safety
Outputs: continuous signals for System 1

System 1: Neural Policies#

Navigation Policy: Move through environment

Input: RGB-D image, current pose
Output: velocity commands
Training: imitation learning from human demonstrations

Manipulation Policy: Interact with objects

Input: object pose, gripper state
Output: end-effector trajectory
Training: reinforcement learning with shaped rewards

Safety Policy: Prevent harm

Input: full sensor state
Output: override signals (stop, retract, alert)
Training: safety constraints, adversarial scenarios

System 2: Symbolic Reasoning#

World Model (Prolog Knowledge Base):

% Commonsense knowledge about kitchens
typical_location(coffee, [cabinet, counter, pantry]).
typical_location(coffee_maker, [counter]).
typical_location(cups, [cabinet, dish_rack]).
typical_location(water_source, [sink, refrigerator]).

% Functional relationships
requires(coffee_maker, [coffee, water, filter, cup]).
requires(drip_coffee, [ground_coffee]).
requires(pod_coffee, [coffee_pod]).

% Process models
brewing_complete(drip) :-
    not(dripping),
    carafe_level(L), L > 0.

brewing_complete(french_press) :-
    elapsed_time(T), T >= 240.  % 4 minutes

brewing_complete(espresso) :-
    flow_stopped,
    shot_volume(V), V >= 30.  % 30ml minimum

Planner (Hierarchical Task Network):

# HTN for coffee brewing
task: make_coffee
  subtasks:
    - find_kitchen:
        methods: [visual_search, ask_human, explore_systematically]
    - locate_coffee_maker:
        methods: [check_counter, check_cabinets, ask_human]
    - identify_coffee_maker_type:
        methods: [visual_classification, read_labels, functional_test]
    - gather_supplies:
        depends_on: [coffee_maker_type]
        subtasks: [find_coffee, find_filter, find_cup, get_water]
    - execute_brewing:
        depends_on: [coffee_maker_type, supplies_gathered]
    - monitor_completion:
        methods: [sensor_based, time_based, visual_inspection]

Monitor: Completion detection

Tracks brewing state via sensors and vision
Applies completion criteria from knowledge base
Handles uncertainty (“check before declaring done”)

Blending Module#

Meta-Policy Network:

class BlendingModule(nn.Module):
    def __init__(self):
        self.context_encoder = TransformerEncoder(...)
        self.blend_head = MLP(hidden_dim, 3)  # [nav, manip, safety] weights

    def forward(self, state, s1_actions, s2_plan):
        # Encode full context
        context = self.context_encoder(state)

        # Compute blend weights per action type
        weights = torch.softmax(self.blend_head(context), dim=-1)

        # Apply prior knowledge constraints
        if state.danger_detected:
            weights[2] = 0.9  # Override to safety

        return self.blend(s1_actions, s2_plan, weights)

Prior Knowledge Injection:

Users can inject commonsense rules:

blending_rules:
  - condition: "danger_nearby OR hot_surface_detected"
    action: "prioritize_safety_policy"
    weight_override: {safety: 0.95}

  - condition: "navigating_unknown_space"
    action: "blend_exploration"
    weight_override: {navigation: 0.7, planning: 0.3}

  - condition: "manipulating_fragile_object"
    action: "prioritize_precision"
    weight_override: {manipulation: 0.4, planning: 0.6}

  - condition: "coffee_brewing_in_progress"
    action: "monitor_mode"
    weight_override: {monitoring: 0.8, safety: 0.2}

6. Implementation with The Edge Agent#

We’ve implemented two runnable agents that demonstrate these concepts. Unlike conceptual architectures, these actually execute and produce results.

Agent 1: System 1/System 2 Blending Demo#

This agent demonstrates Kahneman’s dual-process theory in action — a delivery robot that dynamically blends reactive (System 1) and deliberative (System 2) decision-making based on danger level.

File: examples/wozniak-test/system1-system2-blending.yaml

Key Components:

System 1 Node (Lua): Fast reactive responses to obstacles
System 2 Node (Prolog): Deliberate route planning with constraints
Blending Module (Lua): Context-aware weight allocation

# Core blending logic (simplified)
- name: blend_systems
  language: lua
  run: |
    local danger = state.danger_level or 0.2

    -- Dynamic weight allocation based on context
    if danger > 0.7 then
      -- Emergency: 90% reactive, 10% planning
      s1_weight, s2_weight = 0.9, 0.1
    elseif danger > 0.4 then
      -- Moderate: 60% reactive, 40% planning
      s1_weight, s2_weight = 0.6, 0.4
    else
      -- Safe: 20% reactive, 80% planning
      s1_weight, s2_weight = 0.2, 0.8
    end

    return {blend_weights = {system1 = s1_weight, system2 = s2_weight}}

Agent 2: Coffee Agent Simulation#

This agent simulates the full Wozniak Coffee Test using symbolic reasoning. It demonstrates:

Commonsense Knowledge Base (Prolog): Typical kitchen layouts, item locations
Uncertainty-Aware Search (Lua): Probabilistic item finding
Process Execution (Lua): Step-by-step brewing with failure handling
Dona Maria Verification (Prolog): Completion confidence checking

File: examples/wozniak-test/coffee-agent-simulation.yaml

name: coffee-agent-simulation
description: Simulates the Wozniak Coffee Test using symbolic reasoning

state_schema:
  kitchen_type: str
  coffee_maker_type: str
  world_model: dict
  search_results: dict
  brewing_state: dict
  verification: dict
  final_result: str

nodes:
  # Prolog: Commonsense knowledge base for kitchens
  - name: build_world_model
    language: prolog
    run: |
      % Typical locations for coffee-related items
      typical_location(coffee, [cabinet, counter, pantry]).
      typical_location(coffee_maker, [counter]).
      typical_location(cups, [cabinet, dish_rack]).

      % Coffee maker requirements
      requires(drip, [ground_coffee, filter, water, cup]).
      requires(pod, [coffee_pod, water, cup]).
      requires(french_press, [coarse_coffee, hot_water, cup]).

      % Process steps by coffee maker type
      process_steps(drip, [add_filter, add_coffee, add_water, press_start, wait, pour]).
      process_steps(pod, [insert_pod, place_cup, press_button, wait]).

      state(kitchen_type, KitchenType),
      state(coffee_maker_type, CoffeeMakerType),
      (requires(CoffeeMakerType, Reqs) -> true ; Reqs = []),
      (process_steps(CoffeeMakerType, Steps) -> true ; Steps = []),

      return(world_model, _{
        kitchen_type: KitchenType,
        required_items: Reqs,
        process_steps: Steps
      }).

  # Lua: Search for items using commonsense-guided exploration
  - name: search_items
    language: lua
    run: |
      local world = state.world_model or {}
      local required = world.required_items or {}
      -- Simulate finding items based on typical locations
      -- (Full implementation in examples/wozniak-test/)

  # Prolog: The "Dona Maria Principle" - verify before declaring done
  - name: verify_completion
    language: prolog
    run: |
      state(brewing_state, BrewState),
      (get_dict(status, BrewState, completed) ->
        Passed = true, Confidence = 0.95
      ;
        Passed = false, Confidence = 0.0
      ),
      return(verification, _{passed: Passed, confidence: Confidence}).

edges:
  - from: __start__
    to: build_world_model
  - from: build_world_model
    to: search_items
  # ... (see full file for complete edges)

The full implementations are available in the repository:

7. Try It Yourself#

Prerequisites#

# Install The Edge Agent (Python version for full Prolog support)
cd the_edge_agent/python
pip install -e .

Example 1: System 1/System 2 Blending#

Test how the blending weights change based on danger level:

Low Danger (System 2 Dominant):

python -m the_edge_agent.cli run examples/wozniak-test/system1-system2-blending.yaml \
  --input '{"scenario": "delivery", "danger_level": 0.2, "distance_to_goal": 500}'

Output:

{
  "blend_weights": {"system1": 0.2, "system2": 0.8},
  "dominant_system": "System 2 (Deliberative)",
  "action": "execute_plan",
  "reasoning": "Following optimal plan"
}

High Danger (System 1 Dominant):

python -m the_edge_agent.cli run examples/wozniak-test/system1-system2-blending.yaml \
  --input '{"scenario": "delivery", "danger_level": 0.8, "distance_to_goal": 500}'

Output:

{
  "blend_weights": {"system1": 0.9, "system2": 0.1},
  "dominant_system": "System 1 (Reactive)",
  "action": "evade",
  "direction": "swerve_right",
  "reasoning": "Obstacle detected at close range - reflex avoidance"
}

Example 2: Coffee Agent Simulation#

Successful Brew (Modern Kitchen):

python -m the_edge_agent.cli run examples/wozniak-test/coffee-agent-simulation.yaml \
  --input '{"kitchen_type": "modern", "coffee_maker_type": "drip"}'

Output:

{
  "summary": {
    "kitchen_analyzed": "modern",
    "coffee_maker_used": "drip",
    "items_found": true,
    "brewing_completed": true,
    "verified": true,
    "confidence": 0.95,
    "wozniak_test": "PASSED"
  },
  "final_result": "SUCCESS: Coffee ready!\nConfidence: 95%\nVerification: High confidence - ready to serve"
}

Failed Brew (Minimalist Kitchen):

python -m the_edge_agent.cli run examples/wozniak-test/coffee-agent-simulation.yaml \
  --input '{"kitchen_type": "minimalist", "coffee_maker_type": "espresso"}'

Output:

{
  "summary": {
    "kitchen_analyzed": "minimalist",
    "items_found": false,
    "wozniak_test": "NEEDS_IMPROVEMENT"
  },
  "final_result": "FAILED: Could not complete coffee\nReason: Missing items: cup"
}

This demonstrates realistic failure modes — in a minimalist kitchen with lower item-finding probability, the agent may fail to locate all necessary items, just as a human might struggle in an unfamiliar, sparsely organized kitchen.

8. The Ambulance Driver Analogy#

Think of this architecture like an ambulance driver:

Component	Ambulance Driver	Coffee Agent
System 1	Reflexes: braking, swerving, throttle control	Reactive: obstacle avoidance, safety responses
System 2	Route planning: fastest path to hospital	Task planning: HTN decomposition, search strategies
Blending	Context switch: traffic → reflexes; open road → planning	Context switch: danger → safety; unknown → exploration
Prior Knowledge	“Hospitals have emergency entrances at rear”	“Coffee is usually in upper cabinets”
Completion	“Arrived when parked at ER and patient handed off”	“Done when cup has coffee and brewing stopped”

        flowchart LR
    subgraph Traffic["Heavy Traffic"]
        direction TB
        T1[System 1: 90%]
        T2[System 2: 10%]
    end

    subgraph Highway["Open Highway"]
        direction TB
        H1[System 1: 30%]
        H2[System 2: 70%]
    end

    subgraph Unknown["Unknown Area"]
        direction TB
        U1[System 1: 50%]
        U2[System 2: 50%]
    end

    A((Start)) --> Traffic
    Traffic --> Highway
    Highway --> Unknown
    Unknown --> B((Hospital))

    style T1 fill:#FFB6C1
    style H2 fill:#90EE90
    style U1 fill:#FFB6C1
    style U2 fill:#90EE90

9. Key Innovations Required#

To fully solve the Wozniak Test, several research advances are needed:

1. Robust Object-Centric Perception#

Current limitation: Shindo’s system requires “high-quality object representations.”

Solution needed:

Zero-shot object detection for novel items
Functional affordance recognition (“this knob can be turned”)
Multi-modal fusion (vision + tactile + audio)

2. Commonsense Reasoning at Scale#

Current limitation: Hand-coded knowledge bases don’t scale.

Solution needed:

Extract commonsense from LLMs into structured KB
Learn spatial priors from embodied experience
Transfer knowledge across environments

3. Deep Planning Under Uncertainty#

Current limitation: No multi-step lookahead with unknown states.

Solution needed:

Monte Carlo Tree Search with neural value estimates
Robust planning that handles partial observability
Recovery strategies when plans fail

4. Continuous Learning#

Current limitation: Static policies don’t adapt.

Solution needed:

Online adaptation to new environments
Few-shot learning of new appliances
Human feedback integration

10. Conclusion#

The Wozniak Coffee Test remains unsolved because it requires the integration of capabilities that current AI systems handle in isolation:

Capability	Current State	Required for Coffee Test
Object Recognition	Good on known objects	Robust on novel kitchen items
Spatial Reasoning	Limited	Full 3D scene understanding
Task Planning	Works in simulation	Real-world uncertainty handling
Manipulation	Lab demonstrations	Reliable in unknown environments
Commonsense	Implicit in LLMs	Explicit, actionable knowledge
System Blending	Research prototypes	Production-ready meta-policies

The neurosymbolic architecture presented here — combining neural perception and reactive policies (System 1) with symbolic reasoning and planning (System 2) through a dynamic Blending Module — provides a blueprint for solving this challenge.

The key insight from Kahneman’s work is that intelligence is not choosing between fast and slow thinking — it’s knowing when to use each. Current AI systems are either all System 1 (deep learning) or all System 2 (classical planning). The Wozniak Test demands their integration.

When an AI can finally make coffee in Dona Maria’s kitchen — finding the coffee in her specific cabinet, using her particular coffee maker, and knowing when it’s ready by the same subtle cues she uses — we’ll have achieved something profound: not just artificial intelligence, but artificial wisdom.

11. References#

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Shindo, D., et al. (2023). “Neurosymbolic Reinforcement Learning with Hybrid Action Spaces.”
Garcez, A., & Lamb, L. (2023). “Neurosymbolic AI: The 3rd Wave.” Artificial Intelligence Review.
The Edge Agent (TEA): fabceolin/the_edge_agent
Wozniak Coffee Test: Multiple interviews and tech conference discussions.

Appendix: Why “Dona Maria” Checks#

The observation that even experienced humans “check before declaring done” reveals something profound about completion detection:

Sensory uncertainty: Visual, auditory, and olfactory cues all contribute
Process variation: Same coffee maker behaves differently each time
Quality judgment: “Done” isn’t just “stopped” — it’s “good coffee”
Error recovery: Checking allows catching problems early

Any robust coffee-making AI must incorporate this “final verification” step — acknowledging that completion is a judgment, not just a state transition.

% The Dona Maria Principle
truly_complete(Task) :-
    mechanically_complete(Task),
    verified_by_inspection(Task),
    quality_acceptable(Task).

% Never skip the check
make_coffee_done :-
    brewing_stopped,
    \+ verified_complete,  % Haven't checked yet
    !,
    verify_completion.     % Do the "Dona Maria check"

make_coffee_done :-
    brewing_stopped,
    verified_complete,
    quality_check_passed.

The Wozniak Coffee Test: Why AI Still Can’t Make Coffee in Your Kitchen

Contents

The Wozniak Coffee Test: Why AI Still Can’t Make Coffee in Your Kitchen#

Abstract#

1. Introduction#

2. The Challenge: Why Can’t AI Make Coffee?#

1. Absence of Initial Conditions#

2. Commonsense World Models#

3. Object Recognition Under Uncertainty#

4. Functional Reasoning#

5. Completion Detection#

3. The Inspiration: Kahneman’s Dual-Process Theory#

System 1: Fast, Intuitive, Automatic#

System 2: Slow, Deliberate, Logical#

The Key Insight#

4. The Shindo Architecture: Neurosymbolic Game Agents#

Architecture Components#

System 1: Convolutional Neural Network#

System 2: Differentiable Logic#

The Blending Module#

Results in Games#

5. Proposed Architecture: Coffee Agent#

High-Level Design#

Component Details#

Perception Layer#

System 1: Neural Policies#

System 2: Symbolic Reasoning#

Blending Module#

6. Implementation with The Edge Agent#

Agent 1: System 1/System 2 Blending Demo#

Agent 2: Coffee Agent Simulation#

7. Try It Yourself#

Prerequisites#

Example 1: System 1/System 2 Blending#

Example 2: Coffee Agent Simulation#

8. The Ambulance Driver Analogy#

9. Key Innovations Required#

1. Robust Object-Centric Perception#

2. Commonsense Reasoning at Scale#

3. Deep Planning Under Uncertainty#

4. Continuous Learning#

10. Conclusion#

11. References#

Appendix: Why “Dona Maria” Checks#