The real AI landscape —
beyond Claude and ChatGPT
Most professionals have used a handful of AI tools without understanding what distinguishes them at a mechanism level. This week maps the full AI ecosystem — models, copilots, agents, and the stack beneath them — and builds the evaluation discipline to make defensible recommendations in your professional context.
Pre-reading — Co-Intelligence ch. 1–3
Your reading for this week
"What assumptions do you currently hold about what AI can and cannot do? Where do you think colleagues are most getting AI wrong?"
Answer now, before opening the book. Aim for 150–300 words. Be specific — name actual tools, actual tasks, actual mistakes you have observed. Your answer is stored and you will return to it in Activity 3.5.
"How does Mollick's framing of AI as a 'co-intelligence' challenge or confirm your prior assumptions? What is one specific thing you will do differently this week based on chapters 1–3?"
After completing chapters 1–3, return here and answer this. Not assessed — a thinking prompt to consolidate what you have read before instruction begins.
Your mental model — before instruction begins
The three-level AI tool taxonomy
Most professionals speak about AI tools by brand name — ChatGPT, Copilot, Gemini, Claude. The brand is not the meaningful distinction. What matters is the mechanism beneath the brand: how the tool works, what data it can access, and what structural dependencies it carries. This activity gives you a consistent three-level framework for making those distinctions in your own work and in conversations with colleagues who have not yet made them.
Level 1 — Model mechanism: what the model is built to do
At the foundation, AI tools differ in what their underlying model is optimised for. Four mechanism types are in common professional use.
Text-generation models
Text-generation models are the most familiar. A language model accepts prompts, roles, and context and produces text or structured text-based output. OpenAI describes prompting as the way one "programs" such a model. The primary professional uses are drafting, summarising, classifying, and explanatory writing.
Reasoning models
Reasoning models are a distinct class, optimised for complex problems that require multiple steps of inference. OpenAI explicitly separates reasoning models from standard chat models and notes that they respond differently to prompting. They often require less instruction and more patience, as the model works through a problem before responding. Professional uses include complex analysis, multi-step planning, and demanding logical reasoning tasks.
Multimodal models
Multimodal models process inputs beyond text, including images, audio, documents, and in some cases video, within a single interaction. The practical professional relevance is primarily for tasks involving visual materials such as charts, photographs, or scanned documents.
Tool-using and agentic models
Tool-using and agentic models are capable of calling external tools such as web search, code execution, APIs, and file systems, and chaining those actions across multiple steps without a human directing each one. These are the models closest to what is commonly described as "AI agents."
Level 2 — Application form: how the model is delivered to you
The same underlying model mechanism can be delivered in fundamentally different ways. The delivery form changes what the tool can and cannot do in a professional context.
General chat models
General chat models are models accessed through a public interface such as Claude.ai, ChatGPT, or Gemini, with no connection to your organisation's data, systems, or permissions. The model knows only what you put into the conversation. General chat is highly capable for tasks that can be fully brought to it: drafting from your own knowledge, working through a problem you describe, or generating options from a brief you write.
Enterprise copilots
Enterprise copilots are the same or similar model mechanisms deployed inside your organisation's permission boundary. Microsoft 365 Copilot is the clearest current example. It operates inside the Microsoft 365 service boundary, data access is scoped to the signed-in user's permissions, and the organisation's security, compliance, and privacy policies continue to apply. An enterprise copilot can synthesise a set of documents you have access to without you manually pasting each one.
Level 3 — Provider architecture: the stack beneath the tool
Beneath the model and the application form sits a structural question that most professionals do not consider until it creates a problem. Which provider stack does this tool sit on, and what does that mean for your organisation's data, vendor relationships, and governance obligations?
Stratechery's analysis of the AI platform landscape identifies three structural positions that major providers occupy.
Integrated stacks (Google)
Integrated stacks are providers where the AI model, the cloud infrastructure, and the enterprise applications are all first-party. When an organisation uses Gemini inside Google Workspace, the model, the data storage, and the application layer sit within a single provider relationship. The advantage is coherence. The dependency is vendor lock-in.
Middle-layer stacks (Microsoft)
Middle-layer stacks sit between an integrated and a modular position. Microsoft builds its own enterprise applications and cloud infrastructure but relies on a close partnership with OpenAI for its frontier model capability. The enterprise application and cloud relationship is with Microsoft; the model relationship runs through OpenAI. Governance and data-residency obligations sit primarily with Microsoft, but the model capability is not exclusively Microsoft's.
Modular and marketplace stacks (AWS Bedrock)
Modular and marketplace stacks are provider architectures that offer model choice rather than a single proprietary model. An organisation using AWS Bedrock can access models from Anthropic, Meta, Mistral, and others through a single API, with cloud infrastructure managed by AWS. The advantage is model flexibility and reduced lock-in at the model layer. The trade-off is higher architectural complexity.
How to use this taxonomy
This is an analytic framework, not a fixed product catalogue. Individual tools sit across these categories in ways that blur the lines: a reasoning model can be delivered as an enterprise copilot; an integrated stack can offer a modular API. The taxonomy is a thinking tool. Its value is that it forces mechanism-level reasoning rather than brand-level association.
The most important discipline in applying the taxonomy is always asking the second question: where does this tool fall short? Every category has a structural limit, not just a capability gap. Knowing the limit is what makes the taxonomy professionally useful, and it is what the cases in Unit 2 are designed to test.
Taxonomy knowledge check
These questions test your understanding of the three-level taxonomy from Activity 1.3 — model mechanism, application form, and provider architecture. The questions focus on mechanisms and limits, not brand names. There is no score. Read the explanation after each question before moving on.
How to evaluate AI tools — methodology before judgment
The most common mistake professionals make when evaluating AI tools is treating a single impressive output as evidence of stable capability. It is not. AI model output is non-deterministic: the same prompt, run twice, can produce meaningfully different results. Model behaviour changes between versions. Capability varies across task types in ways that cannot be predicted from brand prestige or general reputation. What looks like a reliable tool in a demonstration may fail on your specific task under your specific conditions.
This has a direct methodological implication. Evaluation must be disciplined, task-specific, and iterative. The three sections below describe what that means in practice.
Discipline 1 — Define your success criteria before you test
The most common evaluation failure is running a test without having stated in advance what a good result looks like. Without a prior definition, you will unconsciously judge outputs against whichever tool impressed you first, or whichever output confirms your prior assumptions.
OpenAI's evaluation guidance is direct on this point: define specific and measurable success criteria. What does a good output look like for this task? What are the minimum quality thresholds? What would a mediocre output look like, and how would you distinguish it from a good one?
For professional tasks, success criteria typically have at least three components. Accuracy: is the output factually correct and grounded in the inputs? Completeness: does it cover what the task requires? Usefulness: would you actually use this output, or would you need to substantially rewrite it before it was fit for purpose?
Discipline 2 — Hold conditions constant across tools
Comparing two tools without holding the test conditions constant is not a comparison. It is an impression. To produce a meaningful comparison, the same task, the same inputs, and the same evaluation rubric must apply to every tool you are testing.
In practice, this means writing the prompt once, running it in each tool, and grading the outputs against the same criteria. Do not adjust the prompt for each tool based on what you think that tool responds well to. If you adjust the prompt, you are testing your prompting skill, not the tool's capability.
For tasks involving documents, grounding conditions must also be constant. A general chat model that receives a pasted excerpt is not comparable to an enterprise copilot that accesses the full document set through permissions. If the grounding conditions differ, you are comparing application forms rather than model mechanisms. That is a different and also useful question, but it must be labelled correctly.
Discipline 3 — Treat evaluation as iterative, not a one-time verdict
A tool evaluation is not a product review. The right answer today may be wrong in six months when the model is updated, when your organisation's data governance changes, or when the task type shifts. Both OpenAI and Anthropic's evaluation guidance explicitly frames evaluation as an ongoing practice rather than a one-time verdict.
The practical implication for your personal tool map is this: each entry should note what you last tested and what would cause you to revise the recommendation. The entry is not permanent. If the constraint that drove your recommendation changes, the recommendation should be revisited.
What the AI providers say about evaluating their own models
Anthropic's documentation on evaluation recommends constructing task-specific evaluations that mirror real-world task distributions, not generic prompts designed to show the model at its best. This includes building automated grading where the task output can be scored, and combining automated metrics with human judgment where quality is interpretive.
OpenAI's evaluation guidance aligns on the same core point: the evaluation must be grounded in your specific task. Generic benchmarks tell you how a model performs on standardised tests. They do not tell you how the model will perform on synthesising your board papers under your compliance constraints.
The convergence between Anthropic and OpenAI's evaluation guidance is itself worth noting. The methodological discipline they describe is not marketing. It is the acknowledgment that their own models are not uniformly capable, and that the responsibility for discovering where a tool fails on your specific task sits with you.
ANZ — compliance-adjacent synthesis
The situation
ANZ has publicly distinguished between AI that is appropriate for synthesis and interpretation work and AI that is not appropriate for compliance output where zero errors are acceptable. That distinction is not a preference — it is a governance position.
A senior analyst at ANZ needs to synthesise 12 board papers on customer remediation outcomes into a single executive summary. The papers were produced over 18 months across three remediation streams. The synthesis is not itself a compliance document — but it feeds directly into a briefing used by executives making decisions about remediation resourcing and regulatory reporting. If the summary misrepresents the pattern across the 12 papers, the downstream decisions are wrong.
The analyst has access to three tool options: a general chat model (Claude or ChatGPT accessed through a browser); M365 Copilot, deployed with standard permission settings — it can access documents the analyst has permissions to read; and a reasoning model variant available through the same general chat interface.
What you need to work out
On application form
The task involves 12 documents. Can you paste 12 board papers into a general chat session? What is the practical limit — and what happens to synthesis quality when you can only work with excerpts rather than the full set? The enterprise copilot can access the documents directly through permissions. What does that change about the output?
On model mechanism
Is this a task that benefits from a reasoning model, or is it primarily a synthesis and extraction task that a text-generation model handles adequately? The papers are long but the task is not primarily logical deduction — it is pattern recognition and summarisation across a structured document set.
On governance
ANZ's public position says AI should not produce zero-error compliance output without human review. This task is adjacent to compliance, not directly in it. Where does the human review step sit? Who sees the AI-assisted summary before it reaches the executive?
"Which tool category would you recommend for this task, and what single governance step would you build in before the output reaches the executive?"
Write your answer before moving to the next activity. Your response is stored and displayed in Discussion 2.5.
Medibank — health data, most constrained environment
The situation
Medibank's operating environment changed permanently in October 2022. The breach — in which data on approximately 9.7 million current and former customers was accessed and subsequently released publicly — resulted in OAIC civil penalty action, APRA FAR scrutiny, and sustained regulatory focus on how the company handles sensitive customer data. This is a live constraint, not a theoretical one.
Health data is the highest sensitivity tier in Australian privacy law. The specific data categories Medibank holds — diagnoses, treatment histories, claims records, care pathways — are personally sensitive in ways that are irreversible if mishandled.
A product manager at Medibank needs to brief the executive team on an AI-assisted sentiment analysis across 40,000 post-claim health survey responses. The surveys capture how customers felt about their claims experience — wait times, communication quality, outcome satisfaction. The data is health-adjacent: it does not contain diagnoses or treatment records, but it is linked to health events and is held under the same privacy obligations as primary health data.
The product manager has access to two tool options and one non-option: a general chat model (accessed through a browser, no organisational data boundary); M365 Copilot, which Medibank has in early deployment — partial rollout, not yet fully configured for this data category; and the option of neither until appropriate controls are in place.
What you need to work out
On data classification
The survey text does not contain diagnoses. Does that change the privacy classification? Under Australian privacy law, data linked to a health event is treated as health information regardless of whether it contains clinical content.
On the general chat model
A general chat model accessed through a browser has no organisational data boundary. Inputting 40,000 survey responses — or even a representative sample — means transmitting health-adjacent data to a provider infrastructure outside Medibank's control. What is the legal position under the Privacy Act and Medibank's current regulatory obligations?
On the enterprise copilot
M365 Copilot is in partial rollout. It has not yet been configured for this data category. Does "partially deployed" mean it is safe to use for this task, or does it mean its data governance controls have not been validated for health-adjacent data?
On centaur logic
If an AI system does process these survey responses, where must the human judgment remain? Who reviews the output, and what specifically are they reviewing?
"Advise the product manager. Three options are available: general chat model, enterprise copilot (partially deployed), or neither until appropriate controls are in place. Which option do you recommend, and on what grounds?"
Connect your answer to the specific privacy constraint, not just the general principle. Write your answer before moving to the next activity.
Accenture — enterprise advisory under commercial sensitivity
The situation
You are a manager in Accenture's financial services practice, based in Sydney. Your team has been engaged by a Singapore-headquartered private bank to advise on an AI readiness assessment across their retail and wealth management operations. The client operates in Singapore under MAS oversight, with affiliated entities in Hong Kong under HKMA supervision, and a subsidiary under APRA's prudential framework in Australia.
Your team spent three weeks in discovery: structured interviews with the client's heads of technology, operations, risk, compliance, and three business-unit leaders. You now have 18 sets of detailed interview notes — a mix of direct quotes, paraphrased positions, and field observations — totalling roughly 40,000 words. The engagement partner needs a 1,500-word client-ready synthesis identifying key themes, internal disagreements, and critical dependencies that will shape the AI roadmap.
Three constraints shape what you can and cannot do.
Client data confidentiality. The interview notes contain the client's internal strategic positions, regulatory concerns, and candid assessments of their organisation's readiness. This is confidential client information. Accenture's professional obligations and the engagement letter restrict what can be transmitted outside the engagement boundary.
Cross-border delivery complexity. Your team is in Sydney. The client's data governance is anchored in Singapore under MAS. The synthesis document will be reviewed by the engagement partner in Singapore before going to the client. MAS has specific expectations about where financial institution data is processed and stored.
Stack architecture. Accenture's internal environment gives your team access to M365 Copilot on the standard Accenture tenant. The client's environment runs on Google Workspace. There is no data connection between the two, and the client has not granted your team access to their Google environment.
What you need to work out
On confidentiality and general chat models
A general chat model accessed through a browser would transmit the interview notes — and the client's confidential strategic positions — to a provider infrastructure outside the engagement boundary. Is there a way to use a general chat model that does not carry this problem? What would that look like for a 40,000-word synthesis task?
On the enterprise copilot
Accenture's M365 Copilot deployment operates within Accenture's Microsoft tenant. If the interview notes are stored there, Copilot can process them within the organisational data boundary. Does that satisfy the confidentiality constraint? Does it satisfy the MAS data sovereignty question, given that Accenture's Microsoft tenant is hosted across multiple Azure regions? This is a question you would need to verify with your engagement risk team, not assume.
On the no-data-input option
One approach is to use a general chat model for structure and prose, without inputting the interview notes directly — the analyst reads the notes and inputs a synthesised description of themes. What does this cost in terms of analytical depth? Is that acceptable given the stakes of the synthesis?
On stack architecture
The client runs on Google. You run on Microsoft. There is no shared AI environment. The stack decision is yours alone.
"Write a recommendation of three to five sentences specifying: (1) which tool category you would use for the synthesis; (2) what specific constraint drove that choice — not a general principle, but the specific constraint in this engagement; (3) what governance step you would take before the synthesis document leaves your team."
More than one combination of answers is defensible. Write your answer before moving to the next activity.
Judgment discussion — across all three cases
Co-intelligence — centaur, cyborg, and the governance implications
The way organisations talk about AI tends toward a binary framing: AI will do this task, or humans will. That framing is both wrong and unhelpful. What Ethan Mollick calls "co-intelligence" describes a more accurate picture of how capable AI is actually being used in professional settings: humans and AI working together, with the quality of that collaboration determined by how deliberately the division of labour is designed.
This activity covers the two work modes Mollick identifies, the conditions that determine which mode applies, and four governance implications that arise when AI is deployed in an organisational rather than a purely individual context.
Centaur work: when the division of labour is explicit
In centaur work, the division between human and AI is designed in advance and held explicitly. The human decides what scope of task goes to the AI, the AI executes within that bounded scope, and the human reviews the output before it is used or shared. The term comes from the mythological figure — half human, half horse — a literal division of two different kinds of capability.
The clearest professional examples of centaur work are in high-governance contexts, where the division of labour is not a stylistic choice but a structural requirement. In a compliance-adjacent synthesis task, the AI might read and summarise a document set; the human must review and sign off before the output reaches the executive. In a legal drafting context, the AI might produce a first draft; a lawyer must review it before it constitutes advice. In these contexts, centaur mode is non-optional because the accountability structure does not change simply because AI was used to produce the output.
Cyborg work: when collaboration is interwoven
In cyborg work, the task boundary is not fixed in advance. Human and AI iterate together in real time: the human steers, the AI responds, and each exchange shapes the next. The labour is genuinely interwoven rather than divided into separate phases.
Mollick's essay "I, Cyborg" frames this mode as the natural one for many knowledge tasks that do not carry the compliance, accountability, or high-stakes failure conditions that make centaur logic mandatory. A senior executive drafting a board communication, a consultant developing a framework for a client, a researcher synthesising literature: in each of these cases, the AI is a continuous collaborator rather than a worker handed a defined scope.
When the environment determines the mode, not the task
This is the question that Activity 3.2 is designed to surface: does the same task produce the same work mode recommendation when you move it across four different organisational contexts? The answer should be no, and understanding why is the point of the capstone.
Verification: why fluent output is not reliable output
Both centaur and cyborg work carry a verification obligation that is routinely underestimated. AI models produce outputs with consistent fluency and apparent confidence regardless of whether the content is accurate. There is no reliable signal within the output itself that distinguishes a correct answer from a plausible-sounding incorrect one.
The practical governance implication is that verification cannot be reserved for cases where the output seems wrong. It must be built into the workflow as a structural step. OpenAI's evaluation guidance notes that human judgment is needed even for tasks with automated scoring, because metrics do not capture everything that matters. Mollick makes the same point: the skill of using AI well includes knowing which outputs to verify and how to verify them.
Data access governance: a first-class design issue
When an enterprise copilot is deployed, it operates on data that the signed-in user has permission to access. This sounds like a technical safeguard, and it is. But it is also a governance design question that must be addressed before deployment, not discovered after.
Microsoft's governance documentation for Microsoft 365 Copilot makes the stakes explicit. It recommends data loss prevention controls, sensitivity labels, prompt and response auditing, data security posture management reviews, and retention and deletion choices aligned to regulatory compliance. This is not a list of optional features. It is recognition that an AI system able to access everything a user can access will also surface information that was technically accessible but practically siloed.
Agent security: a distinct governance layer
Tool-using and agentic AI systems carry a governance burden that general chat models and enterprise copilots do not. Anthropic's tool-use documentation describes the technical reality: some tools execute in the client environment, others on provider infrastructure. The surface area for error and misuse is larger than in a standard chat session.
The practical governance implication is not that agentic tools should be avoided, but that they should be deployed with three disciplines in place. First, least-privilege design: the agent should have access only to what it needs for the specific task. Second, restricted tool scope: limit what external services the agent can call. Third, content review: any pathway where the agent processes external content should be reviewed for injection risk. Agentic systems that can read from and write to external systems are security-sensitive applications, not chat assistants with extra features.
Ecosystem architecture: tool choice as a dependency decision
The final organisational implication of co-intelligence framing returns to the stack architecture introduced in Activity 1.3. When an organisation chooses which AI tool to deploy at scale, it is not only choosing a product. It is making a dependency decision that will constrain future choices.
Stratechery's analysis of the AI platform landscape identifies the procurement implication directly. Lock-in, data gravity, and provider dependency are real variables. An organisation that standardises on Google's integrated stack has made a set of choices about where its data lives, which model serves its applications, and which vendor it renegotiates with when pricing or capability changes.
Mollick's framing is sometimes read as motivational. That is a misreading. Co-intelligence is an operating model. Once understood as such, work design, evaluation, verification, data access, security, and procurement all fall within the same governance frame. Decisions about how humans and AI divide labour have organisational consequences that persist beyond the individual interaction. Activity 3.2 is designed to make those consequences concrete.
Capstone — same task, four organisational contexts
The task — identical across all four contexts
You are asked to synthesise 20 stakeholder interview notes into a 500-word executive brief. The notes are from structured interviews covering AI readiness and strategic priorities. The brief goes directly to the executive team.
The task is the same in all four cases. The only thing that changes is the organisational context.
Context 1 — ANZ
You are a senior analyst at ANZ. The interview notes are from internal stakeholders and are stored in SharePoint under standard ANZ document permissions. The brief goes to the Chief Risk Officer and two executive committee members as part of a quarterly AI governance review.
ANZ operates under APRA FAR and has publicly stated that AI should not produce zero-error compliance output without human review. The AI governance review directly informs compliance posture decisions.
Context 2 — Medibank
You are a senior strategist at Medibank. The interview notes are from internal clinical and operational leaders. Several interviewees referenced specific patient cohorts and operational incidents involving health-sensitive context. The notes are stored in the document management system.
Medibank operates under APRA FAR and OAIC oversight, is under active regulatory scrutiny following the 2022 breach, and has not completed its M365 Copilot rollout. The brief goes to the Chief Strategy Officer and is expected to feed into a board paper.
Context 3 — Accenture
You are a manager in Accenture's financial services practice. The interview notes are from a client engagement — a Singapore private bank — and contain the client's confidential strategic positions. You have access to M365 Copilot within Accenture's tenant. The brief is client-deliverable: it goes to the engagement partner first, then to the client's CEO.
The client operates under MAS. Accenture's engagement confidentiality obligations apply. The data sovereignty question has not yet been cleared with the engagement risk team.
Context 4 — XpertiseNow
XpertiseNow is a consulting marketplace and expert-network platform where organisations connect with independent consultants for project-based engagements. It operates with a lean team and does not operate under APRA, MAS, OAIC, or comparable prudential oversight. Its revenue model depends on the quality of its expert matching and the trust of both clients and consultants.
You are a senior member of the XpertiseNow team. The interview notes are from consultants and clients who participated in a research project on how AI is affecting the market for expert advice. Participants consented to their views being used for internal strategy purposes. The notes contain commercially sensitive assessments — views on competitors, pricing tolerance, emerging substitution concerns — but no regulated personal data. XpertiseNow's technology environment is lightweight: Google Workspace, no enterprise AI deployment.
Your deliverable
Complete the table below. Your answers are stored and shared with the tutor before Discussion 3.4 begins. Be specific in each cell — not just "enterprise copilot" but the specific tool and why it fits this context's constraints.
| Context | Tool category | Governance step | Work mode |
|---|---|---|---|
| ANZ | |||
| Medibank | |||
| Accenture | |||
| XpertiseNow |
"In which context were you least confident? Name the specific thing you do not yet know — about the tool, the constraint, or the governance — that would change your answer if you knew it."
Personal tool map
Your tool map is not a list of favourite products. It is a tested decision artifact: a record of which tool category you would use for a specific task, in a specific context, and why — along with the known failure mode and the governance step you would build in. You have worked through three cases and a capstone. Now you apply the same logic to your own professional context. Your answers are stored and shared with the tutor before Discussion 3.4 begins — the tutor reads them before the discussion opens. There are no correct answers. There are complete answers and incomplete ones. A complete answer connects the task to the constraint, the constraint to the tool choice, and the tool choice to the governance step.
Tool map discussion
What changed — and what remains uncertain
This is the last thing you do in Week 1 and the first thing referenced in Week 2. It has two purposes: consolidating what changed this week, and naming what remains uncertain so it becomes a structured input to Week 2 rather than a vague gap.
Your pre-reading answer from Activity 1.1 is displayed below alongside your mental model from Discussion 1.2 — you can see your before and after directly.
Your initial mental model — a benchmark you will return to across the course
Your open items — the things you flagged as untested become Week 2 targets