The Research Synthesis — AI in Education

Drawn from the AI Research Corpus, May 2026

This synthesis pulls together findings from the corpus that bear on AI in education, broadly construed — including direct studies of classroom use, empirical work on cognition and knowledge work with AI, institutional policy documents, and research on AI's reasoning limits insofar as it informs what we should teach learners about these systems. About fifty papers and essays in the corpus were judged relevant; the synthesis foregrounds the strongest empirical work and the most-cited normative arguments, with a comprehensive treatment in the second half.

I. Executive Summary

The core empirical findings

1. Heavy AI use during learning has measurable cognitive costs. Kosmyna and colleagues at MIT Media Lab (Your Brain on ChatGPT, 2025) conducted an EEG study of essay-writing in which 83% of ChatGPT-assisted writers could not produce a single correct quote from the essay they had just written, compared with 11% of search-engine and pen-and-paper writers; neural connectivity scaled inversely with external support, and reported sense of ownership over the writing dropped sharply in the ChatGPT condition. A Microsoft–CMU survey of 319 knowledge workers by Lee, Sarkar, Tankelevitch and colleagues (The Impact of Generative AI on Critical Thinking, CHI 2025) showed self-reported "much less" or "less" cognitive effort across every Bloom-taxonomy activity once AI was used, and an MIT–OpenAI longitudinal trial led by Fang and colleagues (How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use, 2025) found that across all chatbot modalities and conversation types — including non-personal, task-oriented uses — higher daily usage correlated with higher loneliness, greater emotional dependence on the AI, lower socialization, and more problematic use.

2. Confidence calibration is the critical cognitive variable. Across multiple studies, self-confidence is protective and AI-confidence is corrosive. Lee and colleagues found that workers with higher confidence in AI expended less critical-thinking effort, while workers with higher confidence in their own abilities expended more. Schwarcz, Das, Kang and McDonnell (Thinking Like a Lawyer in the Age of Generative AI, 2025) found that lawyers using AI rated it "equally helpful" across six tasks despite the objective benefit varying enormously between tasks (and reaching zero on an NDA-drafting task) — they were unable to detect when AI had actually helped them. Shaw and Nave at Wharton (Thinking — Fast, Slow, and Artificial, 2026), using a modified Cognitive Reflection Test with nearly 1,400 participants, document a behavioral pattern they call cognitive surrender: users consult AI on more than half of trials, accuracy rises 25 points when the AI is right and falls 15 points when it errs, and participants with lower analytic disposition and higher trust in AI are most prone.

3. The "leveling" effect is replicated across domains. AI consistently helps weaker performers more than stronger ones, and in some conditions actively hurts the strongest. Bednar, Cleveland, Erbsen and Schwarcz (Artificial Intelligence and Human Legal Reasoning, 2026), in a randomized trial with Minnesota law students, found that AI revision improved performance for students below the median by an average of nearly two points while reducing the highest-scoring students' work by eight points; Choi, Monahan and Schwarcz (cited in several of the legal-education pieces) report drops of up to twenty percentile points among the strongest students once AI was permitted on an exam. The same pattern shows up in the broader knowledge-work literature (Dell'Acqua and colleagues), and it is consistent with Madden's framing (The Dignity of Legal Education vs Artificial Intelligence in Australian Law Schools, 2025) of AI as compressing the upper end of the performance distribution.

4. Sequencing matters more than the binary of "AI yes / AI no." The same Bednar trial that documented the leveling effect on revision found no skill atrophy from prior AI exposure: students who used AI on a synthesis task performed equally well on a closed-book comprehension test and better on a no-AI application task — but only because the AI-assisted synthesis produced stronger notes to work from. The mediating mechanism is scaffolding, not direct skill transfer. This finding partly reconciles the contrast between Kosmyna's "cognitive debt" warning (in essay-writing, where AI produces the artifact) and the legal-education finding (where AI produces interim work the human then re-uses).

5. Adoption is far outrunning pedagogy. OpenAI's joint analysis with Harvard and Duke (Chatterji et al., How People Use ChatGPT, 2025), based on more than a million sampled ChatGPT conversations, finds that 10.2% of all user messages — and 36% of all "Practical Guidance" messages — are explicit requests for tutoring or teaching, making tutoring one of the single largest uses of the world's most-used AI product. Walsh, writing in New York Magazine (Everyone Is Cheating Their Way Through College, 2025), cites a January 2023 survey in which roughly 90% of college students had used ChatGPT for homework within two months of its launch; a UK study slipping fully AI-generated work past professors via fake profiles found that 97% went undetected. By Shroff's reporting in The Atlantic (Is Schoolwork Optional Now?, 2026), agentic tools by 2026 are completing whole online courses end-to-end, including watching the lectures and submitting the quizzes.

6. AI is structurally bad at the kinds of reasoning students must learn to detect. A cluster of independent studies establishes that large language models, including the newer "reasoning" variants, fail in characteristic ways that students need to be taught to recognize. Apple's Shojaee, Mirzadeh and colleagues (The Illusion of Thinking, 2025) show that reasoning models reduce their inference effort precisely as problem complexity rises, collapsing to zero accuracy past a threshold and unable to follow an algorithm even when one is supplied. Lewis and Mitchell (Evaluating the Robustness of Analogical Reasoning in LLMs, 2024) show that LLM analogical reasoning drops sharply under permuted alphabets or symbol substitutions that humans handle without difficulty, with strong order-effect biases. Fu and colleagues' AbsenceBench (Language Models Can't Tell What's Missing, 2025) shows that the same models which find a "needle in a haystack" at near-perfect accuracy cannot reliably notice what is missing from a document — a profound deficiency for grading, code review, or any "is anything wrong here?" task. Ding and Li (Generative AI lacks the human creativity to achieve scientific discovery from scratch, Scientific Reports 2025) show in a Nobel-prize-discovery replication that ChatGPT-4 generated fewer than half the hypotheses humans did, never revised them in light of disconfirming evidence, and exhibited "no aha moment." Felin and Holweg (Theory Is All You Need, Strategy Science 2024) argue that this is structural: LLMs do "backward-looking imitation," not forward-looking theorizing — for them, "truth — if it happens to emerge — is a byproduct of statistical patterns."

7. Working effectively with AI is itself a teachable skill, distinct from subject-matter ability. Riedl and Weidmann (Quantifying Human–AI Synergy, 2025), using item-response theory on data from 667 humans paired with two different language models on math, physics, and moral-reasoning problems, find that "joint ability" is a statistically distinct construct from "solo ability," and that the strongest predictor of how well a user gets AI to help them is Theory of Mind — the user's ability to model what the AI knows and how to clarify for it. This trait predicts joint performance but not solo performance, and it varies meaningfully not just between users but within a user across questions.

What the corpus collectively recommends for educators

There is a striking convergence across very different authors — Stazi (Towards a University of Hybrid Intelligence, 2026) from a European policy perspective, Burnett (Will the Humanities Survive Artificial Intelligence?, 2025) from the Princeton humanities, the AI-Ready Lawyer competency framework (2025), the EUI's institutional guidelines (Guidelines for the Responsible Use of AI for Research, 2024), Lemley's IP-final design at Stanford (Driscoll, The AI Curriculum, Stanford Lawyer 2026), Lande's dispute-resolution faculty handbook (How I Learned to Stop Worrying and Love the Bot, 2025), O'Rourke writing on creative writing pedagogy (The Seductions of A.I. for the Writer's Mind, 2025) — on a single prescriptive vision:

Sequencing. Build unaided baseline skill first; introduce AI second. Stanford's 1L Legal Research and Writing program does its first semester in a "closed universe without AI tools" before the spring semester adds AI to simulate summer-associate work; the EUI urges instructors to redesign assessments around "critical analysis over memorization"; Stazi makes "Slow AI" — AI used to slow thinking down, in a Socratic dialogue, rather than to speed it up — the central pedagogical principle.
Verification as the new assessed skill. The most-cited specific assignment design is the one that makes evaluating AI output the thing being graded: Mark Lemley asks his Stanford IP students to grade a set of AI-generated answers; the AI-Ready Lawyer framework names "critical oversight" as one of five pillars and insists it must be exercised "simultaneously" with any AI use; Lande recommends requiring students to submit their AI chat transcripts as part of "show your work."
Abandoning AI detection. The European University Institute explicitly refuses to use detection software because of inadequacy and false-positive risk; the 97% undetected-rate from the UK study cited by Walsh reinforces that detection is not a viable enforcement strategy. The institutional response that has emerged is disclosure-plus-redesign rather than policing.
Teaching about AI's failure modes, not just its capabilities. The reasoning-limits literature implies a concrete pedagogical agenda: students need to know about brittleness under variation, the prediction-versus-explanation distinction, ML "leakage" in scientific use (Narayanan and Kapoor), the prevalence of training-data memorization masquerading as reasoning, and the categorical inability of these systems to notice absence or to disconfirm their own hypotheses.
Training collaborative-cognitive skills explicitly. Riedl and Weidmann's Theory-of-Mind finding implies that prompt-design exercises should foreground perspective-taking — what does the AI need to know in order to help me? — as a distinct, trainable competency.

The most important tensions and open questions

Is there a deep skill-atrophy effect? Kosmyna's EEG and quoting data say yes within a four-month essay-writing study. Bednar and colleagues' legal RCT, over a similar window, says no — when AI is used for scaffolding rather than for producing the artifact. Different tasks, different sequencing, different findings. This is the most consequential open empirical question.
Equity. Both Breen and Madden warn that AI may polarize educational outcomes — well-resourced students get Burnett's transformative New Yorker assignment; under-resourced students get Shroff's Einstein-bot automation. None of the corpus offers strong evidence on the size of this gap.
Vendor entanglement with curricula. The corpus documents an active pipeline from AI companies into classrooms (Anthropic's "Claude Builder Clubs," OpenAI's $100 Codex credits to students, OpenAI's Columbia partnership, ChatGPT Plus free for finals). No paper rigorously studies the effects.
Are the older studies already obsolete? Several of the strongest empirical results predate the now-default "reasoning" models, and the Apple reasoning-limits paper suggests the new models fail in different ways than the standard LLMs studied earlier. Replication on current systems is overdue.

II. Comprehensive Review

A. What happens cognitively when a learner uses AI?

The empirical anchor here is Kosmyna and colleagues' MIT Media Lab study (Your Brain on ChatGPT, 2025), an EEG investigation in which 54 participants wrote SAT-style essays across three sessions in one of three conditions — ChatGPT, search engine, or pen-and-paper — followed by an optional fourth "swap" session for eighteen of them. The neural data showed strong, widespread brain connectivity in the unaided writers, intermediate connectivity in the search-engine group, and the weakest overall coupling in the ChatGPT group across the alpha, beta, theta, and delta bands. The behavioral correlates were striking: 83% of LLM users in the first session could not produce a correct quote from the essay they had just written, against 11% in the comparison groups; on a second quoting measure, none of the eighteen ChatGPT users produced a correct quote, against fifteen and sixteen out of eighteen in the comparison groups. Sense of ownership over the writing collapsed in the LLM condition. Asymmetric carryover appeared in the fourth session: participants who used the LLM in sessions one through three and then went solo showed weaker connectivity and biased "LLM-specific" vocabulary; participants who began unaided and only added the LLM at the end showed higher memory recall and a network-wide spike in directed connectivity. The authors frame the phenomenon as cognitive debt — deferred mental effort yielding "diminished critical inquiry, increased vulnerability to manipulation, decreased creativity" — and argue for "an educational model that delays AI integration until learners have engaged in sufficient self-driven cognitive effort." The study is a preprint and the interpretation of EEG connectivity findings is contested, but the behavioral pattern (memory, ownership, output homogeneity) is unusually clean.

The Microsoft–CMU survey by Lee, Sarkar, Tankelevitch and colleagues (The Impact of Generative AI on Critical Thinking, CHI 2025), complements the lab finding at population scale. The team surveyed 319 knowledge workers across diverse occupations and collected 936 first-person task examples, framing critical thinking through Bloom's taxonomy. They report two central results. First, self-reported "much less" or "less" cognitive effort with AI was common across every Bloom category — 72% for Knowledge, 79% for Comprehension, 69% for Application, 72% for Analysis, 76% for Synthesis, and 55% for Evaluation. Second, and more diagnostically, which kind of confidence the worker held was decisive: higher confidence in GenAI predicted less critical thinking, while higher self-confidence in one's own ability predicted more critical thinking. Their qualitative analysis suggests that AI does not simply suppress critical thinking — it reshapes it, shifting the worker's role from production to oversight: information gathering becomes information verification, problem-solving becomes AI-response integration, task execution becomes "task stewardship." Time pressure, low task stakes, and unfamiliar domains all suppressed critical engagement, with users defaulting to acceptance.

Fang and colleagues' four-week MIT–OpenAI randomized trial (How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use, 2025) — 981 participants, more than 300,000 messages, a 2 × 3 design crossing chatbot modality (text / neutral voice / engaging voice) with conversation type (open-ended / non-personal / personal) — found that across all modalities and conversation types, higher daily usage correlated with higher loneliness, higher emotional dependence on the AI, more problematic use, and lower socialization. Notably for educational settings, the "non-personal" conversation condition — the closest analogue to "AI as study tool" — was associated with greater dependence in heavy users. The text modality, often imagined as the most innocuous, elicited the most emotional self-disclosure and the worst psychosocial outcomes once time spent was controlled for. Voice initially appeared to reduce loneliness but eroded that benefit in heavy users. The authors' framing — "social snacking" rather than relationship substitution — implies that even the academic use cases carry psychosocial dosage effects.

Three further studies extend the cognitive picture. Salvi, Horta Ribeiro, Gallotti and West's Nature Human Behaviour paper on the conversational persuasiveness of GPT-4 (2025; n = 900) found that GPT-4 given a brief sociodemographic profile of its opponent won 64% of debates against humans and produced an 81% increase in the odds of post-debate agreement, while non-personalized GPT-4 was statistically on par with humans. Humans given the same demographic data did not use it effectively. The result has educational implications running in two directions — AI's personalizability is a genuine asset for tutoring, and a vulnerability for media-literacy curricula that must now contend with a partner that can out-argue students. Shaw and Nave's cognitive surrender studies (Thinking — Fast, Slow, and Artificial, 2026), with 1,372 participants across three preregistered experiments using a modified Cognitive Reflection Test, document a 25-point accuracy gain when the AI was right and a 15-point loss when it erred, with confidence rising even after AI errors and time pressure and incentives failing to extinguish the surrender pattern. And the OpenAI–Harvard–Duke analysis of more than a million ChatGPT conversations by Chatterji and colleagues (How People Use ChatGPT, 2025) shows that "Asking" messages (decision support) are not only more common than "Doing" messages (output generation) but consistently rated higher in quality — the technology's actual economic value is mostly in cognitive scaffolding, not output replacement, a pattern more compatible with tutoring than with ghostwriting.

B. What kind of learning task is helped, hurt, or unchanged?

The Bednar, Cleveland, Erbsen and Schwarcz randomized trial (Artificial Intelligence and Human Legal Reasoning, 2026) with roughly a hundred Minnesota law students is the most informative single study in the corpus for understanding task-by-task effects, because its preregistered design walked the same students through four sequential legal tasks: synthesis (AI permitted for the treatment group), comprehension (closed-book multiple choice for everyone, no AI), application (memo writing for everyone, no AI), and revision (AI permitted for everyone).

The results were:

On the synthesis task with AI, the AI-exposed group's quality scores rose 60% above the control group's, an effect size of 1.20 standard deviations — large enough to move the average AI user from the twenty-fifth percentile to the seventy-first.
On the comprehension task without AI, there was no detectable difference between groups (3.88 vs 3.86, p = .935). Prior AI exposure did not impair recall of the underlying legal material.
On the application task without AI, the AI-exposed group again outperformed the control by 24%, a result that reversed the team's preregistered hypothesis of skill atrophy. The mechanism was indirect: controlling for the quality of the earlier synthesis task, AI exposure no longer predicted application performance. Better synthesis notes produced better downstream application work.
On the revision task with AI for everyone, a clear leveling effect appeared: participants below the application-task mean improved (the lowest scorer gained 1.91 points); participants above the mean regressed (the highest scorer lost 8.08 points). Strong drafts can be flattened by AI editing.

These findings sit alongside the Schwarcz, Das, Kang and McDonnell synthesis essay "Thinking Like a Lawyer in the Age of Generative AI" (2025), which collects earlier RCT evidence (the six-task Schwarcz study and the Choi–Monahan–Schwarcz exam study) showing the same leveling pattern at larger scales, plus an important metacognitive finding: lawyers consistently rated AI "equally helpful" across all six tasks even when the objective benefit ranged from substantial to zero, and they underrated o1-preview precisely where it most improved their work. The LawFlow workflow-simulation work by Das and colleagues (cited in that essay) adds the observation that AI-generated task plans are "exhaustive, uniformly structured, treating each step as equally significant" — exactly the opposite of how experienced lawyers actually work, which is adaptive, recursive, and weighted by uncertainty. The Schwarcz paper argues that the major bottleneck to AI adoption in law is not hallucination (RAG-enhanced tools showed no excess hallucinations on the assignment where they delivered zero quality gain) but novices' inability to evaluate AI output in unfamiliar terrain.

The cross-cutting implication is that sequencing matters more than blanket bans or blanket permissions. AI used to produce scaffolding the human then works from appears to help downstream unaided performance. AI used to produce the final artifact, or to revise already-strong work, can hurt. The contrast with Kosmyna's essay-writing study is not necessarily a contradiction: essay-writing is exactly the task where the AI produces the final artifact, while the Bednar synthesis-then-application sequence places the human's effortful work after the AI's contribution.

C. AI's reasoning limits and what students need to know about them

A substantial portion of the corpus is not directly about education but about what LLMs can and cannot reason about. Read as inputs to AI literacy, these papers describe a curriculum of things every learner should be taught to recognize.

The Apple team — Shojaee, Mirzadeh and colleagues (The Illusion of Thinking, 2025) — tested reasoning models including Claude-3.7-Sonnet-Thinking, DeepSeek-R1, and o3-mini on the Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World puzzles. They identified three regimes: at low complexity, non-reasoning models outperformed the reasoning variants and used fewer tokens; at medium complexity, the reasoning models gained an edge; at high complexity, both collapsed to zero accuracy past a threshold. Most diagnostically, the reasoning models' inference effort fell as problem complexity rose past that threshold — they "gave up" while still having tokens available. Even when the team supplied the algorithm in the prompt, performance did not improve: the models could not execute a deterministic procedure step-by-step at scale. Performance on River Crossing with N = 3 was worse than Tower of Hanoi with N = 5, despite the latter requiring 31 moves and the former 11 — likely because instances of River Crossing with N > 2 are scarce in training data. The paper's title — "The Illusion of Thinking" — names the diagnosis precisely: visible chain-of-thought is not evidence of reasoning, and reasoning models scale differently from human reasoning in ways that students need to be taught to notice.

Lewis and Mitchell's robustness study of analogical reasoning (Evaluating the Robustness of Analogical Reasoning in LLMs, 2024) shows the same pattern in a different domain. Humans scored 0.754 on letter-string analogies and were essentially unaffected by permuted alphabets or symbol substitutions; GPT-4 scored 0.452 in baseline conditions and dropped sharply under those same variations. On digit-matrix problems, GPT-4 scored 0.477 to humans' 0.771 — and lost roughly 33 points just from moving the blank cell from bottom-right to elsewhere. On story analogies, GPT-4 was 100% correct when the correct answer was listed first but only 72% when it was second, an order bias humans did not exhibit. The authors' conclusion — that LLM analogical performance reflects "narrow, non-transferable procedures for task solving" rather than general reasoning — implies that students should be taught to probe AI by varying problems in superficial ways and looking for the brittleness.

Fu, Shrivastava, Moore, West, Tan and Holtzman's AbsenceBench (Language Models Can't Tell What's Missing, 2025) tackles a complementary failure mode. The same models which solve "needle in a haystack" problems at near-perfect accuracy across million-token contexts cannot reliably tell what is missing from a document. Claude-3.7-Sonnet-Thinking, the top scorer, reaches only 69.6% F1 averaged across poetry, numerical sequences, and GitHub pull requests at a context of just five thousand tokens. The average drop from "find the inserted line" to "find the deleted line" across models was 56.9 F1 points. Inserting an explicit placeholder (<missing line>) where content was omitted recovered an average of 35.7 points — confirming the mechanism: attention can lock onto present information but not its absence. The authors note the implication for "LLM-as-Judge" applications: if the model cannot notice missing content, students should not trust it to find gaps in their work.

Ding and Li's Scientific Reports paper (Generative AI lacks the human creativity to achieve scientific discovery from scratch, 2025) takes the question to scientific discovery itself by asking ChatGPT-4 to replicate the Monod–Jacob lac-operon discovery in a semi-automated genetics lab, using human think-aloud protocols from Okada and Simon (1997) as a baseline. ChatGPT-4 scored 1 on a discovery rubric where humans scored 1.67. It generated 5 hypotheses to humans' 14, of 2 distinct types to humans' 7.78, performed 4 key experiments to humans' 9.67, and never proposed an alternative hypothesis or revised one after disconfirmation. It proposed all twelve experiments at once rather than sequentially. There was no "aha moment." The authors' explanation — that for an LLM trained on a corpus dominated by post-discovery descriptions, the threshold for surprise is much higher than for humans — generalizes: LLMs collapse the hypothesis space toward consensus, exactly the opposite of what genuine discovery requires. Felin and Holweg's Strategy Science paper (Theory Is All You Need, 2024) provides the conceptual frame: an LLM trained on all pre-1633 texts would have "overwhelmingly" supported Ptolemy because that view dominated the training corpus; truth, in an LLM, "if it happens to emerge — is a byproduct of statistical patterns," not of correspondence. Their core teaching claim is that human cognition is forward-looking and theory-driven while LLMs are backward-looking and imitative, and the two are categorically different operations — not points on a continuum.

The science-of-AI-in-science cluster — Messeri and Crockett in Nature on "illusions of understanding in scientific research" (2024), Narayanan and Kapoor on overreliance on AI-driven modeling (Nature Comment, 2025), Roscher and colleagues' explainable-ML framework (Explainable Machine Learning for Scientific Insights and Discoveries, 2020) — extends the same warnings to research training. Messeri and Crockett identify three illusions: an illusion of explanatory depth (using an AI predictive model inflates the user's sense of having understood the phenomenon), an illusion of exploratory breadth (researchers believe they are searching the hypothesis space when they are really searching the AI-testable subset), and an illusion of objectivity (treating AI as standpoint-free when it embeds the standpoints of its training data and developers). Crucially for educators, they cite evidence that students using AI on test questions overestimate their own knowledge — exactly the metacognitive failure Schwarcz documents in lawyers, transposed to the classroom. Narayanan and Kapoor add a quantitative anchor: AI use in research has roughly quadrupled across twenty fields between 2012 and 2022 (with growth above 500% in linguistics, philosophy, and medicine), and a systematic review of "AI diagnoses COVID-19 from chest X-rays" studies found that of hundreds of papers, only 62 met basic quality standards and even many of those had merely learned to distinguish adults from children because positives were adults and negatives were children. They call leakage — the AI analogue of training/test contamination — "teaching to the test, or, worse, giving the answers away before the exam." Their pedagogical recommendation is direct: "Courses on quantitative methods should train researchers in machine learning alongside statistics, and ensure that common pitfalls and mitigations are studied."

Riedl and Weidmann's quantification of human–AI synergy (Quantifying Human–AI Synergy, 2025) offers the most optimistic counterweight in this cluster. Across 667 humans paired with either GPT-4o or Llama-3.1-8B on math, physics, and moral-reasoning multiple-choice problems, the average AI boost was 23 percentage points with the weaker Llama model and 29 with the stronger GPT-4o, with the large solo gap between the two models shrinking dramatically once paired with a human. More importantly for educators, the team uses item-response theory to show that "joint ability" and "solo ability" are statistically distinct constructs: the strongest predictor of joint performance is the user's Theory of Mind — their ability to model what the AI knows and how to clarify for it. ToM correlates significantly with joint performance (Spearman ρ = 0.17) but not with solo performance (ρ = 0.06). The trait varies meaningfully within a user across questions, suggesting it is not a fixed disposition but a moment-to-moment skill. The implication for curricula is that prompting and AI collaboration should be taught as a perspective-taking skill — what does the AI need to know to help me? — rather than as a recipe or a vocabulary.

D. Institutional and pedagogical responses

The corpus contains a striking convergence of prescriptions from very different authors. Stazi (Towards a University of Hybrid Intelligence, 2026), drawing on UNESCO's 2023 Guidance for Generative AI in Education and Research and the OECD's 2026 Digital Education Outlook, names what OECD calls the "performance paradox" or "mirage of false mastery": students who delegate critical thinking to AI receive better short-term grades but perform worse than non-AI peers once AI access is withdrawn. His prescriptive frame, "Towards a University of Hybrid Intelligence," advances Ronald Beghetto's distinction between "Slow AI" — AI used as a Socratic dialogue partner to slow thinking — and "Fast AI" used as an oracle. He proposes three hybrid skills that universities should embed: metacognitive awareness (knowing when to use AI), creative co-creation, and evaluative judgment, and he advocates for "pedagogical sovereignty" — institutional EdGPTs trained on curriculum-aligned, bias-cleansed corpora. Eurostat data he cites suggests only 56% of Europeans aged 16–74 have basic digital skills against an EU 2030 target of 80%, framing this as a workforce-equity imperative.

The European University Institute's Guidelines for the Responsible Use of AI for Research (approved 2024) offers a concrete real-world template. Its three principles — literacy and self-awareness, individual responsibility, and disciplinary diversity — refuse a one-size-fits-all rule. Most notably, EUI explicitly does not use AI-detection software, citing inadequacy and false-positive risk; it relies on disclosure and individual responsibility backed by standard plagiarism remedies for breach. Permitted uses include ideation, lit-review prompting (with fact-checking), language polishing, and visualization; prohibited uses include AI autonomously writing "substantial or integral parts" of dissertations and uploading confidential data. Supervisors are explicitly told to rethink assessment toward "critical analysis over memorization" and not to use AI to evaluate student performance. The document acknowledges potential inclusion benefits for non-native speakers and dyslexic students and commits to equitable access.

In legal education specifically, the AI-Ready Lawyer competency framework (2025) — built on ABA Formal Opinion 512 (2024) and grounded in Model Rule 1.1's duty of competence — lays out five pillars with three proficiency levels each: AI fluency, AI-enhanced legal work, critical oversight, ethical AI governance, and professional evolution. It insists that the second and third pillars operate simultaneously: "A lawyer should never apply AI to legal work without simultaneously engaging in the oversight and verification that responsible use demands." Its recommendations for law schools include integrating AI literacy into the core curriculum (not as a standalone elective), running simulated exercises that include identifying errors in AI-generated work product, and making critical evaluation of AI output an assessed competency. Madden's parallel argument from Queen Mary (The Dignity of Legal Education vs Artificial Intelligence in Australian Law Schools, 2025), framed around the "dignity of legal education," extends the worry to skill stunting → unemployability → erosion of social conscience, and catalogs hallucination-in-court cases from Australia and abroad (Valu v Minister for Immigration, Re Dayal, Mata v Avianca, Zhang v Chen) as evidence that classroom misuse migrates directly into client harm.

The Stanford Lawyer profiles (Driscoll, The AI Curriculum, 2026) of the law school's curricular response provide a concrete case study. Mark Lemley's Intellectual Property final consists of giving students a set of questions with AI-generated answers and asking them to grade those answers — operationalizing "verification as the skill." Alicia Thesing's 1L Legal Research and Writing program runs the entire first semester in a "closed universe without AI tools" and only adds AI in the spring, mirroring summer-associate work; her guiding principle is "summer-ready." Mills Legal Clinic and Bernice Grant's Entrepreneurship Clinic explicitly foreground risks — including the now-canonical "cited cases that don't exist" failure — before students use AI with clients. The library hired two new AI-focused librarians, claimed to be the first of their kind in U.S. academic law libraries. A companion piece on the LiftLab (Driscoll, Law, Disrupted, Stanford Lawyer 2026) — Stanford's Legal Innovation through Frontier Technology Lab — describes the AI cross-examination simulator Atelier (winner of the Financial Times' 2025 legal innovation award) and reports a striking finding from Nyarko and Ma's study of twelve attorneys reviewing twelve contracts: perfect overlap on which contracts were "bad" but no overlap on why — suggesting AI cannot be trained on a single shared ground truth in mid- to high-level legal work.

Lande's "How I Learned to Stop Worrying and Love the Bot" (2025) is the most practical of the legal-pedagogy pieces. It cites an ABA Task Force survey reporting that 83% of responding law schools now offer students curricular opportunities to use (not merely learn about) AI — in legal writing, trial advocacy, drafting, analytics, and professional responsibility. A small University of Missouri survey he conducted with Jayne Woods found that students were most comfortable with AI but worried about inaccuracy, cheating accusations, privacy, environmental impact, and especially impeded development of legal skills — preferring integrated coursework, hands-on workshops, in-class demos, and tutorials over standalone training. Lande prescribes requiring students to submit AI chat transcripts as "show your work," integrating AI into simulations across all dispute-resolution roles, and using Rogers's diffusion-of-innovations theory to anticipate faculty adoption variance.

Conklin and Houston's empirical study of AI in legal scholarship (Measuring the Rapidly Increasing Use of AI in Legal Scholarship, 2025) — using the ChatGPT-idiosyncratic word "delve" as a tracer in Westlaw — found ~650 articles per year using the word in the decade through 2022, jumping to 724 in 2023 and 907 in 2024, a roughly 40% increase that they argue is implausibly large under any random-noise account. They explicitly caution against weaponizing the method against individual scholars (a single word is not plagiarism), but the normative worry — a "generic, monolithic voice" in legal scholarship, internalized by students who read those articles — is worth taking seriously.

From the humanities side, three essays converge on a more transformative argument. Burnett's New Yorker piece "Will the Humanities Survive Artificial Intelligence?" (2025) reports that not a single hand went up when he asked thirty undergraduates across twelve majors whether they had used AI, diagnosing institutional incoherence rather than dishonesty. His own classroom experiment — assigning students to converse with a chatbot about the history of attention, edit to four pages, and submit — produced what he calls "the most profound experience of my teaching career," including a student who reported the dialogue "felt like an existential watershed... I don't think anyone has ever paid such pure attention to me and my thinking." His strongest prescriptive claim is that coercion has ended: "You can no longer make students do the reading or the writing. So what's left? Only this: give them work they want to do." Breen, a UC Santa Cruz historian (AI makes the humanities more important, but also a lot weirder, 2025), makes a complementary argument that humanistic skills have become central to AI development itself — pointing out that OpenAI's GPT-4o sycophancy fix was a prose system-prompt edit — and recommends that humanities educators build their own AI-based assignments rather than cede the ground to commercial products. O'Rourke (The Seductions of A.I. for the Writer's Mind, 2025), executive editor of The Yale Review and a Yale creative-writing professor, makes the case for pass/fail grading in writing classes, in-class writing labs without AI access, and explicit policies grounded in personal evidence that AI's seduction is not the cheating but the "mirroring" — when AI returns a sharp version of one's own thought — that hollows out felt authorship. She quotes a line ChatGPT once produced for her: "Style is the imprint of attention. Writing as a human act resists efficiency because it enacts care."

E. The classroom reality the institutions are responding to

The journalism in the corpus is most useful for setting the empirical context against which the institutional responses are being designed. Walsh's New York Magazine feature (Everyone Is Cheating Their Way Through College, 2025) pivots around Chungin "Roy" Lee — the Columbia student who self-reported AI wrote roughly 80% of every essay, built "Interview Coder" (an anti-LeetCode tool), and then raised $5.3 million for Cluely, a screen-listening agent advertised for cheating on "digital LSATs; digital GREs; all campus assignments, quizzes, and tests." His suspension after posting about the hearing has not slowed the company. The piece embeds data points worth retaining: a January 2023 survey finding roughly 90% of college students had used ChatGPT for homework within two months of launch; a June 2024 UK study slipping fully AI-generated work past professors in which 97% went undetected; the Microsoft–CMU critical-thinking study; AI-detector unreliability (Turnitin tuned for false negatives, ZeroGPT flagging Genesis as 93% AI); and the use of "Trojan-horse" prompts ("mention Finland," "mention Dua Lipa") that occasionally work and expose that some students do not even read their own submissions. Shroff's Atlantic piece (Is Schoolwork Optional Now?, 2026) extends the picture forward to 2026 and to agentic AI — describing the "Einstein bot" by Advait Paliwal that accepts a student's Canvas credentials and then watches the lectures, does the readings, takes the quizzes, and submits homework — and reports that the percentage of school-age students self-reporting AI homework use rose 14 percentage points in seven months. Both pieces are journalistic rather than empirical, but they document the institutional pressure that the EUI, Stanford, and AI-Ready Lawyer responses are designed to meet.

F. Tensions, open questions, and limitations

Three substantive tensions cut across the corpus. The first is whether AI causes durable skill atrophy. Kosmyna and colleagues' EEG evidence — particularly the asymmetric carryover effect, where students who used the LLM in the first three sessions continued to show weaker connectivity and biased vocabulary even when working unaided — points strongly toward atrophy. The Bednar legal RCT, over a similar time horizon, finds no atrophy and a positive indirect effect of AI exposure on later unaided performance. The reconciliation is probably about task type and sequencing — AI producing the final artifact (essay) vs. AI producing scaffolding the human then transforms (synthesis notes feeding a memo) — but the corpus does not contain a direct test of that hypothesis. The second tension is equity. Breen and Madden both argue that AI use will polarize educational outcomes — well-resourced students get Burnett-style transformative assignments, under-resourced students get Einstein-bot automation — and the journalism strongly implies the same. The corpus does not contain quantitative evidence on the size of that gap. The third tension is the entanglement between AI vendors and curricula: Walsh documents Columbia's OpenAI partnership and free ChatGPT Plus during finals; Shroff documents Anthropic's "Claude Builder Clubs" paying student ambassadors and OpenAI's $100 Codex credits for students. No paper studies the effects of this pipeline.

The corpus also has limitations the reader should keep in mind. The strongest single empirical study — Kosmyna et al. — remains a preprint with a small sample (n = 54) and contested EEG-connectivity methods. The Microsoft–CMU survey is self-report. The legal-education RCTs use small student samples at a single institution. The reasoning-limits literature was largely conducted on models that were state-of-the-art six to twenty-four months before this writing; the gap between studied systems and current frontier systems is widening. And the institutional documents — EUI guidelines, Stanford curriculum, AI-Ready Lawyer framework — describe what their authors propose, not what is reliably implemented at scale.

G. Synthesis: what the corpus tells educators

Pulling the threads together, four general claims are well-supported across the corpus.

First, AI use changes the cognitive structure of the learning task, often in ways the learner cannot detect. Confidence in AI rises faster than verification ability; self-assessed understanding inflates without underlying comprehension; effort drops across every Bloom category. This is documented at the EEG level (Kosmyna), at the population-survey level (Lee et al.), in the controlled lab (Shaw and Nave), and in the metacognitive ratings of working professionals (Schwarcz et al.).

Second, task design and sequencing are the most powerful pedagogical levers educators have. The same student can show no skill atrophy when AI is used for scaffolding (Bednar) and clear cognitive debt when AI is used to produce the artifact (Kosmyna). The leveling effect — AI helping the bottom and sometimes hurting the top — is replicated enough times to be treated as a robust empirical finding, and it has direct curricular implications: AI tools designed for "average" performance can flatten high-performance work.

Third, AI literacy is not just about prompting; it is fundamentally about understanding the failure modes of these systems. The reasoning-limits literature — Apple's complexity collapse, Lewis-Mitchell brittleness, AbsenceBench, Ding-Li discovery failures, Felin-Holweg's truth-as-frequency — describes a coherent profile of how LLMs fail. Students need to be taught this profile explicitly, not as a cautionary footnote but as the substantive content of a methods course.

Fourth, the institutional response that has emerged across very different settings is consistent: build unaided skill first, introduce AI second, make verification of AI output the new assessed skill, abandon detection enforcement in favor of disclosure and assessment redesign, and treat "working with AI" as a separable trainable competency centered on perspective-taking. This is documented at the policy level (Stazi, EUI), in the law-school competency framework, in the Stanford curriculum profiles, and in the dispute-resolution faculty literature (Lande). The convergence is striking enough that an institution adopting it would be drawing on a wide and varied evidence base.

What the corpus does not yet provide is durable longitudinal evidence — over more than four months, across more than one institution, in more than one task domain — of whether these institutional responses actually preserve the cognitive capacities they are designed to protect. That is the most pressing research gap.

Synthesis prepared from approximately fifty corpus items judged relevant under a broad reading of "AI in education" — directly educational papers and policy documents, empirical work on cognition and knowledge work with AI, and reasoning-limits research informing AI literacy.

Sources

Each entry names the file in this folder (after the em-dash). Where a paper has a stable public identifier (DOI, arXiv, SSRN, journal URL), that is given on a second line.

Empirical work on cognition and knowledge work with AI

Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. arXiv:2506.08872v1. — file: 2506.08872v1.pdf Public: https://arxiv.org/abs/2506.08872

Lee, H.-P. (Hank), Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., & Wilson, N. (2025). The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. CHI '25 (ACM). — file: lee_2025_ai_critical_thinking_survey.pdf Public: https://doi.org/10.1145/3706598.3713778

Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., Maes, P., Phang, J., Lampe, M., Ahmad, L., & Agarwal, S. (2025). How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. arXiv:2503.17473v1. — file: 2503.17473v1.pdf Public: https://arxiv.org/abs/2503.17473

Shaw, S. D., & Nave, G. (2026, working paper). Thinking—Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender. Wharton School, Univ. of Pennsylvania. — file: 6097646.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646

Salvi, F., Horta Ribeiro, M., Gallotti, R., & West, R. (2025). On the conversational persuasiveness of GPT-4. Nature Human Behaviour. — file: 4472D97F-4BF1-426F-8BEF-6CA66275F2AB.pdf Public: https://doi.org/10.1038/s41562-025-02194-6

Chatterji, A., Cunningham, T., Deming, D., Hitzig, Z., Ong, C., Shan, C., & Wadman, K. (2025). How People Use ChatGPT. OpenAI / Duke / Harvard working paper, Sept 15, 2025. — file: economic-research-chatgpt-usage-paper.pdf

AI use, task design, and sequencing in legal education

Bednar, N., Cleveland, D., Erbsen, A., & Schwarcz, D. (2026, forthcoming). Artificial Intelligence and Human Legal Reasoning. University of Minnesota Law School working paper. — file: 6525800.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6525800

Schwarcz, D., Das, D., Kang, D., & McDonnell, B. (2025). Thinking Like a Lawyer in the Age of Generative AI: Cognitive Limits on AI Adoption Among Lawyers. University of Minnesota Law School draft, May 19, 2025. — file: 5260645.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5260645

Lande, J. (2025). How I Learned to Stop Worrying and Love the Bot: What I Learned About AI and What You Can Too. Univ. of Missouri Legal Studies Research Paper No. 2025-23. — file: 5254156.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5254156

Conklin, M., & Houston, C. (2025). Measuring the Rapidly Increasing Use of Artificial Intelligence in Legal Scholarship. — file: 5190385.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5190385

Madden, R. (2025). The Dignity of Legal Education vs Artificial Intelligence in Australian Law Schools. Queen Mary Law Research Paper No. 472/2025. — file: ssrn-5563519.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5563519

The AI-Ready Lawyer: A Competency Framework for the AI Era. (Post-ABA Formal Opinion 512, 2024–2025.) — file: The AI-Ready Lawyer_A Competency Framework for the AI Era.pdf

Institutional and pedagogical responses

Stazi, A. (2026). Towards a University of Hybrid Intelligence. Techno Polis Policy Brief n. 2/2026. — file: 6386218.pdf Public: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6386218

European University Institute (EUI) Ethics Committee. (2024). EUI Guidelines for the Responsible Use of Artificial Intelligence for Research. Approved by Academic Council, 15 May 2024. — file: 2024.06-Ethics-Commitee-EUI-GENAI-DIGITAL.pdf

Driscoll, S., with Schreiber, M. (2026, April 6). The AI Curriculum. Stanford Lawyer, Issue 113. — file: 3CD3F38E-2B84-4B96-B6F8-34A6FE316540-original.txt

Driscoll, S. (2026, April 6). Law, Disrupted: How Powerful AI Tools Are Transforming Legal Practice and Education. Stanford Lawyer, Issue 113. — file: A7031DD4-526A-4A3A-9611-EEE2E630BF99-original.txt

Burnett, D. G. (2025, April 26). Will the Humanities Survive Artificial Intelligence? The New Yorker. — file: A198370B-3D49-442F-B238-BEC35E4FE833-original.txt

Breen, B. (2025, May 7). AI makes the humanities more important, but also a lot weirder: Historians are finally having their AI debate. (Substack newsletter.) — file: 1011DFBB-A3A8-4DAC-9C91-CCC437ABFC11-original.txt

O'Rourke, M. (2025). The Seductions of A.I. for the Writer's Mind. The New York Times. — file: 0F44FF71-4038-441A-8E4D-4936E5533592-original.txt

Classroom realities (journalism)

Walsh, J. D. (2025, May 5). Everyone Is Cheating Their Way Through College. New York Magazine / Intelligencer. — file: 44B761D7-6DEC-4947-ADDE-7A851B70CF66-original.txt

Shroff, L. (2026, April 10). Is Schoolwork Optional Now? The Atlantic. — file: D3A894C4-877C-47B2-B469-64952A7381EB-original.txt

AI reasoning limits and AI literacy

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML research preprint. — file: the-illusion-of-thinking.pdf

Lewis, M., & Mitchell, M. (2024). Evaluating the Robustness of Analogical Reasoning in Large Language Models. arXiv:2411.14215v1. — file: 2411.14215v1.pdf Public: https://arxiv.org/abs/2411.14215

Fu, H. Y., Shrivastava, A., Moore, J., West, P., Tan, C., & Holtzman, A. (2025). AbsenceBench: Language Models Can't Tell What's Missing. arXiv:2506.11440. — file: 2506.11440.pdf Public: https://arxiv.org/abs/2506.11440

Felin, T., & Holweg, M. (2024). Theory Is All You Need: AI, Human Cognition, and Causal Reasoning. Strategy Science, 9(4), 346–371. — file: felin-holweg-2024-theory-is-all-you-need-ai-human-cognition-and-causal-reasoning.pdf Public: https://doi.org/10.1287/stsc.2024.0189

Riedl, C., & Weidmann, B. (2025). Quantifying Human–AI Synergy. Preprint. — file: RiedlWeidmann2025-Human-AI-Synergy.pdf

AI and the production of scientific knowledge

Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627, 49–58. — file: s41586-024-07146-0.pdf Public: https://doi.org/10.1038/s41586-024-07146-0

Narayanan, A., & Kapoor, S. (2025). Why an overreliance on AI-driven modelling is bad for science. Nature, 640, 312–314 (Comment). — file: d41586-025-01067-2.pdf Public: https://doi.org/10.1038/d41586-025-01067-2

Ding, A. W., & Li, S. (2025). Generative AI lacks the human creativity to achieve scientific discovery from scratch. Scientific Reports, 15, 9587. — file: s41598-025-93794-9.pdf Public: https://doi.org/10.1038/s41598-025-93794-9

Roscher, R., Bohn, B., Duarte, M. F., & Garcke, J. (2020). Explainable Machine Learning for Scientific Insights and Discoveries. IEEE Access, 8, 42200–42216. — file: Explainable_Machine_Learning_for_Scientific_Insights_and_Discoveries.pdf Public: https://doi.org/10.1109/ACCESS.2020.2976199

This synthesis is also browsable as a filterable map — the evidence explorer lets you find the study behind any claim and follow it to the source.