HAA Rubric Guide — Day 2 — Teacher AI Literacy Assessment

Overview Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8

Day 2 · Dimension 2 · ~6 hours

Extension Operation Capability

Designing and Evaluating Extensions — Hands-On Operation

D2 measures what you can do with AI. Prompt engineering logs must show deliberate pedagogical intent-externalisation, not trial-and-error. Each iteration must annotate what was wrong and why — this metacognitive annotation is the primary evidence. Output evaluation must be subject-specific.

Level 1–2

Revisions note “too long” or “off-topic” — no reference to pedagogical intent. Evaluations flag errors without connecting to student learning. Customised artefact is a generic template.

Level 3

Each revision explains what pedagogical function failed. Evaluations name the specific misconception the error would generate. Artefact references class ZPD and tested failure mode.

Level 4

Prompt log shows strategic theory of change. Evaluation identifies framing bias rooted in the subject’s disciplinary history.

4 TASKS What you will produce on Day 2

01 Prompt Engineering Log

LOG-A≥8 iterations

Pick one real teaching challenge from your current unit. Across at least eight iterations, work with an AI tool to produce something that would actually help you teach that thing. For each iteration you record:

Pedagogical intent — one sentence, stated before you write the prompt. Not “better examples” — “examples that force a Year 9 student to distinguish between proportional and linear relationships by varying the initial value.”
Full prompt text. Every word, including the system prompt or persona. No paraphrasing.
AI output summary. What the AI produced — not your reaction yet.
Evaluation against the intent. Did this output serve the intent? Where did it fail? Be specific about which student and which misconception.
Revision rationale. Before you write the next prompt, state which pedagogical function was not served and what you are changing to serve it.

That fifth field is the most important. Teachers at Level 3 write rationales as a coherent story of diagnosing and adjusting. Teachers at Level 2 leave rationales like “too long” or “off-topic” that tell us nothing about pedagogical function. If you find yourself writing “too generic,” ask yourself generic in what way, and generic at what cost to learning — and write that instead.

Evaluation criteria: D2.1 Extension Design Capability

02 Three Output Evaluation Reports

LOG-D3 reports

For each of three AI-generated materials relevant to your subject, write an evaluation that:

Identifies at least one error. Factual (hallucination), bias (implicit framing), or overconfidence (an answer that should have been hedged). Across your three reports, at least one must address a subtle error — not the obvious ones the tool itself would catch.
Explains why the error matters in your subject. “This is wrong” is not an evaluation. “This is wrong and if a student believed it they would form a misconception that would mis-teach the next three lessons” is an evaluation. Trace the learning consequence.
Writes a corrected version. Short. Show what right looks like.

The subtle-error report is the hardest and the most important. Subtle errors are the ones you’ll miss in a hurry, which means they’re the ones students will absorb and carry forward.

Evaluation criteria: D2.2 Extension Evaluation Capability

03 Customised Extension Artefact

Prompt template / system prompt / task chain

One thing you build with AI, not just ask from AI — something another teacher could pick up and reuse in your subject and get a reliable result. Your artefact must include:

Pedagogical rationale — what this artefact is for in a student’s learning journey, not what it does mechanically.
Target HAA type — A, B, or C, and why. A B-type artefact that’s accidentally a C-type is a problem worth knowing about.
Boundary conditions — situations where this artefact should not be used.
At least one tested failure mode. Actually use your artefact until it breaks. Document how it broke. A customised artefact without a documented failure mode is a Level 2 artefact no matter how polished it looks — we cannot trust an extension we haven’t seen fail yet.

Evaluation criteria: D2.3 Extension Customisation Capability

04 Tool Selection Rationale

Head-to-head comparison

Compare two AI tools head-to-head on one lesson activity in your unit. You must test both, not read reviews. Recommend one for that specific activity and state why.

Use your A/B/C taxonomy from Day 1 to structure the comparison: are the two tools offering the same extension type with different quality, or offering different extension types entirely? That reframing matters — a lot of tool-selection decisions go sideways because teachers compare tools as if they were all A-type, when the real difference is that one scaffolds and one amplifies.

Evaluation criteria: D2.3 Extension Customisation Capability

3 SUB-COMPETENCIES Evaluation criteria for Day 2

D2.1 Extension Design Capability Core for A-type

Primary evidence: Phase 2 · Prompt Engineering Log (≥8 iterations)
Key question: Does each revision annotate what pedagogical function failed and why?

 1 — Nascent
Uses AI through trial-and-error with vague prompts; accepts first outputs without iteration. Cannot explain rationale for prompt choices. Prompt design disconnected from pedagogical objectives.
 2 — Developing
Applies basic prompt elements (role, task, format) but inconsistently. Some iteration but without explicit rationale. Prompt design occasionally aligns with pedagogical goals accidentally. Does not adapt across extension types.
 3 — Proficient
Systematically designs prompts that precisely externalise pedagogical intent; each decision traceable to a learning objective. Deliberate, annotated refinement. Adapts strategy across A-, B-, and C-type modes; treats prompt engineering as cognitive externalisation.
Anchor: Every revision includes explicit rationale; final prompt annotated to a learning objective.
 4 — Advanced
Develops reusable prompt templates colleagues can adapt; teaches prompt design as a pedagogical skill. Diagnoses failures by tracing them to ambiguities in pedagogical intent. Creates subject-specific prompt design guides grounded in learning science.

p2t1Prompt Engineering LogLOG-A

≥8 annotated iterations. For each: state the pedagogical intent, full prompt, AI output summary, evaluation against the intent, and explicit revision rationale before the next attempt.

Pedagogical intent stated before the prompt, not retrofitted after output
Revision rationale names a pedagogical failure, not a format failure
Arc shows genuine learning — later iterations are qualitatively different
At Level 3+, references your specific class’s needs in the evaluation

D2.2 Extension Evaluation Capability Universal, esp. C-type

Primary evidence: Phase 2 · Three Output Evaluation Reports
Key question: Are identified errors subject-specific, and are their learning consequences named?

 1 — Nascent
Accepts AI outputs with minimal evaluation; uses fluency or length as proxy for quality. Cannot identify hallucinations, bias, or culturally loaded assumptions without prompting.
 2 — Developing
Identifies obvious factual errors and common bias types when prompted. Inconsistently evaluates whether output extends teaching judgment vs. provides convenience. Recognises echo-chamber risk conceptually but cannot identify it in own interactions.
 3 — Proficient
Systematically evaluates outputs for accuracy, bias, over-confidence, and educational appropriateness. Reliably distinguishes genuine cognitive extension from echo-chamber confirmation. Maintains consistent standards across domains; documents rationale.
Anchor: ≥1 subtle error identified; all errors connected to likely student misconceptions.
 4 — Advanced
Develops subject-specific evaluation frameworks. Identifies subtle bias forms (framing effects, cultural assumptions, representation gaps) and traces them to training data. Uses findings to improve prompt design and train students.

p2t2Three Output Evaluation ReportsLOG-D

Identify ≥1 hallucination, bias, or over-confident claim per AI-generated material. Explain why the error matters in this subject; write a corrected version. At least one evaluation must address a subtle error.

Errors identified with subject-specific impact analysis
The “subtle error” is genuinely subtle — not obvious to a non-specialist
Corrected versions show the teacher knows the content well enough to fix it

D2.3 Extension Customisation Capability Core for B/C-type

Primary evidence: Phase 2 · Customised Extension Artefact
Key question: Is the artefact grounded in this class’s ZPD, or is it a generic template?

 1 — Nascent
Uses AI tools in default configurations; makes no adjustments for teaching context. Cannot distinguish between tool defaults and configurable parameters.
 2 — Developing
Makes basic customisations (adjusting tone, adding context) in an ad-hoc way. Not systematically linked to pedagogical goals; limited to modifying existing templates.
 3 — Proficient
Configures AI tools systematically for specific teaching contexts; creates reusable templates grounded in learning objectives. Articulates why each customisation serves a pedagogical goal. Customisations demonstrably improve alignment with intended outcomes.
Anchor: Artefact references class’s prior knowledge; includes tested and documented failure mode.
 4 — Advanced
Designs multi-step AI workflows for complex pedagogical tasks. Evaluates trade-offs between customisation depth and generalisability; shares frameworks with colleagues. Contributes to institutional standards.

p2t3Customised Extension Artefact

Prompt template, system prompt, or task chain for a B- or C-type activity. Include pedagogical rationale, target HAA type, boundary conditions, and ≥1 tested failure mode.

Designed for a specific class, not “Year 10 Science” in general
Boundary conditions are operationally concrete
The tested failure mode is documented with actual AI output

p2t4Tool Selection Rationale

Compare two AI tools for one unit activity using the A/B/C taxonomy. Evidence of direct testing of both tools required.

Both tools were actually tested (not reviewed from documentation alone)
Comparison uses A/B/C taxonomy, not feature lists
Selection justified by fit to the specific activity and class

← Day 1

Back to overview

Day 3 →