Architecting for Heterogeneity: Using Your Fleet to Tame AI Costs

by Jason Fields | Nov 18, 2025

AI budgets are climbing fast—average monthly AI spend is projected to jump 36% this year—yet only about half of organizations say they can confidently evaluate the ROI behind those costs. (CloudZero) At the same time, daily AI usage among desk workers has exploded (up more than 200% in six months in some surveys), which means every small “ask” to an AI model is now a real line item in your P&L. (Salesforce)

For CIOs and CTOs in finance, tech, and healthcare, the issue is no longer whether to invest in AI. It’s what you’re actually paying for—and how much of that token and infrastructure spend is quietly wasted on workloads that could (and should) be handled locally on the hardware you already own.

This piece is about that gap: how AI cloud costs creep up, what they look like in the real world, and why shifting non-research, everyday tasks to on-device AI—using CPU, GPU, and increasingly the NPU—should be a deliberate part of your 2026 planning.

The new AI line item: token + infra burn

Most AI conversations still focus on model quality, safety, and “transformation.” Far fewer get specific about the cost mechanics:

Tokens: Every prompt and response is billed in tokens—essentially the words and characters your models chew through.
Infrastructure: You’re paying not just for model inference, but for the cloud platforms, vector databases, GPUs, observability, networking, and security layers that sit around it.

Recent research on AI costs highlights three uncomfortable realities for large enterprises:

Average AI spend per org is already in the mid–five figures per month, and is expected to rise by about 36% year over year. (CloudZero)
Cloud-based AI tools take the biggest share of that budget, especially public cloud platforms and generative AI services. (CloudZero)
Only 51% of companies feel confident they can actually calculate AI ROI, largely because of hidden cloud expenses and poor cost attribution. (APMdigest)

Layer that onto BCG’s finding that only about 5% of companies are realizing significant, scalable value from AI while 60% struggle to achieve material gains and you get a pretty clear picture: spend is up, usage is up, and value is uneven. (BCG Global)

From what I’m hearing in year-end strategy reviews, CIOs aren’t questioning AI’s potential—they’re worried about drift: AI costs growing faster than discipline, especially when every low-stakes summary, rewrite, or translation defaults to an expensive cloud model.

Not every prompt deserves the cloud

It’s worth drawing a hard line between three categories of AI work:

Research and high-value reasoning Deep market or risk research Complex modeling and simulation Cross-domain analysis that genuinely benefits from cutting-edge cloud models and massive context windows
Domain-specific, sensitive workloads Anything touching PII, PHI, or regulated transactions Use cases where the risk of data exfiltration or mishandling is non-negotiable
Everyday “digital chores” (non-research tasks) Summarizing documents, calls, and emails Rewriting content for tone or clarity Breaking complex material into simpler explainer content Translating internal docs or messages Cleaning up notes, tickets, and handoffs

That third bucket—digital chores—is where token burn quietly gets out of hand.

The irony is that technically, the market has already given us a better option. Hardware and model efficiency have advanced to the point that smaller, GPT-3.5-level systems can now run at a fraction of past costs, and open-weight models are within a couple of percentage points of closed models on many benchmarks. (Stanford HAI)

So while training frontier models still demands massive, specialized infrastructure, running capable models for everyday tasks no longer has to—especially if you’re willing to use the silicon you’re already purchasing: CPUs, GPUs, and NPUs on AI-class endpoints.

For a lot of non-research, non-public-data work, sending every prompt to the cloud is like using a private jet for last-mile delivery: impressive, but economically absurd.

I think you will be hearing the term ‘heterogeneous’ more and more as it applies to the hardware and specifically silicon. Partners of ours like Intel are making great strides to configure their silicon to solve customers needs by reducing friction to deploy AI at scale. And this term has impact outside of the hardware space, it also lives in how systems are aligned to do the right thing, at the right time. You can learn more about Intel’s approach in their article.

What AI cloud costs look like in healthcare, finserve, and tech

None of this is abstract. Here’s how cloud-centric AI costs tend to show up in the three sectors you and I live in.

Healthcare: Paying premium rates to clean up notes

Where cloud AI is being used:

Clinical documentation and note-taking
Care coordination messaging and handoffs
Operational reporting and quality documentation

What it costs in practice:

Thousands of daily prompts to summarize, rewrite, and structure clinician dictation
Repeated calls on long EHR notes, discharge summaries, and referral letters—for formatting changes or minor edits
All of it routed through premium cloud models, largely for convenience, not necessity

Where this is overkill:

Intelligent dictation + revision workflows that never leave the four walls of a hospital network
Routine transformations of already internal data (e.g., “shorten this note for the patient portal,” “clean up this handoff for the night team”)

Most of these tasks don’t require new public knowledge. They’re formatting, clarity, and compression jobs. They’re ideal candidates for local models that can run on clinician laptops or edge devices—keeping PHI closer to the point of care and cutting token and infra spend at the same time.

Financial services: Using frontier models to rewrite paragraphs

Where cloud AI is being used:

Turning complex research and risk output into digestible summaries
Drafting client communications, briefs, and talking points based on internal analysis
Summarizing call notes, service tickets, and case histories

Where cloud is absolutely appropriate:

Deep research against public market, macro, and regulatory data
Complex modeling where you genuinely want the strongest possible general-purpose model

Where it’s clearly overkill:

Polishing and shortening internal content that’s already been vetted
Turning analyst reports into bullet points for a client email
Breaking legal or regulatory text into “explain it to me like a client” summaries

In finserve, sensitivity to PII and sovereign data is already forcing careful scoping for cloud AI. What often slips through the cracks is the economics: paying frontier-model rates to do glorified word processing on content that never needed to leave your perimeter in the first place.

Tech companies: Burning tokens to move text between tools

Where cloud AI is being used:

Internal documentation summaries
Product specs → short briefs → release notes → customer-facing copy
Support tickets and incident reports → postmortem drafts
Developer-written content repackaged for other audiences

Where you don’t want this piece to go:

Deep into dev tools. The goal here isn’t to critique Copilot or similar. It’s the layer around those tools.

Where overkill shows up:

Taking content generated in dev or product tools and repeatedly summarizing, rewriting, and translating it as it moves into docs, marketing, sales enablement, and support
Each transformation is small—“clean up this paragraph,” “turn this into a call script”—but they happen thousands of times a week

Here, cloud spend adds up not because a single query is expensive, but because volume multiplies small, unnecessary costs.

A conservative napkin math: what 20% misallocation costs you

Let’s do the simplest possible math for a 50,000-employee organization. And don’t take my math as gospel, do your own, see how it plays out.

Assumptions (all intentionally conservative):

60% of employees are active desk workers using AI regularly → 30,000 people
Each uses AI for 20 small tasks per workday (summaries, rewrites, translations)
Each task consumes roughly 1,000 tokens (prompt + completion)
You’re using a premium cloud model at around $15 per million tokens
220 working days per year
And only 20% of those prompts are “overkill” that could have been served by a local, on-device model

The math works out like this:

30,000 workers × 20 prompts/day = 600,000 prompts/day
600,000 prompts × 1,000 tokens = 600 million tokens/day
600M tokens ≈ 600 “million-token units”
600 × $15 ≈ $9,000/day in cloud inference for these chores
Over ~220 days, that’s about $2 million/year in cloud AI just for everyday tasks
If only 20% of that volume is “overkill” that could have run locally, you’re quietly spending ~$400,000/year on the wrong tool for the wrong job

That’s one workload category, under conservative assumptions, in a single large enterprise. As daily AI usage grows (and the Slack Workforce Index suggests it’s growing very quickly), that overkill slice scales with it. (Salesforce)

And remember: this is only the direct per-token spend. It does not include:

The cost of cloud GPUs and dedicated infra
New observability and cost management tools
Security reviews, policy work, and approvals
Data engineering and integration efforts to wire AI into your existing stack

CloudZero’s analysis is blunt on this point: hidden cloud and maintenance costs are among the biggest reasons companies struggle to measure AI ROI at all. (APMdigest)

Architect for heterogeneity: start local on whatever silicon you have

None of this is an argument to abandon cloud AI. You need cloud models for:

Research and high-value reasoning
Large-scale training and fine-tuning
Scenarios where you genuinely benefit from global context and cutting-edge capabilities

But your architecture shouldn’t treat every prompt like it’s one of those.

A more durable pattern looks like this:

Classify workloads by risk and complexity, not just by “does it use AI.” Tier 1: High-value, research-grade, cross-domain work → Cloud is fine, maybe necessary. Tier 2: Sensitive internal workloads → Could be private cloud or tightly-controlled environments. Tier 3: Everyday digital chores on internal data → Default to local, on-device models wherever possible.
Architect for heterogeneity from day one. Start local on whatever silicon you already have—CPU and GPU in your current fleet—and then scale into the NPU wave as you refresh hardware.

Modern AI PCs ship with NPUs specifically optimized for low-latency, low-power inference on exactly the kind of small models that excel at summarization, rewriting, and translation. The point isn’t just speed; it’s cost control and predictability.

Don’t strand your existing fleet. If your entire AI strategy only runs on the newest hardware, you’re signing up for a long, expensive refresh cycle before you ever see broad value. Any serious local-AI plan should: Run efficiently on CPU/GPU for older endpoints Automatically take advantage of NPUs when they’re present Keep the experience consistent enough that users don’t care which chip is doing the work
Shift the default for low-risk tasks to local. Rather than writing stricter and stricter cloud AI policies and hoping employees comply, change the default: Local by default for non-research, low-risk tasks Cloud by exception for truly high-value workloads that justify the extra cost and complexity

When you do this, cloud AI becomes a deliberate strategic resource, not the path of least resistance for every “clean up this email” prompt someone fires off between meetings.

Where to start

If you’re reading this as a CIO or CTO and thinking “yes, but our AI roadmap is already in motion,” here’s a simple place to begin without ripping anything out:

Instrument what you already have. Get a clearer view of AI usage by function: who’s calling what models, for which tasks, and at what cost. You don’t need perfect attribution; directional stats are enough to spot obvious overkill.
Carve out the digital chores. Identify the top 3–5 non-research use cases in each major function (clinical doc, care coordination, ops; research → comms; internal docs → customer-facing summaries). Treat these as a distinct workload class and model what happens if 20% of those calls move local.
Pilot a local-first pattern in one or two high-volume teams. Pick a unit where the chores are obvious: clinical documentation, operations, or internal comms. Run a small pilot that keeps those workloads on-device, using models sized to run on CPU/GPU today and NPU hardware as it arrives.
Feed the results back into governance—not just finance. Use what you learn to refine not only budgeting, but policies: where cloud is warranted, where local is preferred, how exceptions are handled. The goal is a sustainable mix of cloud and local that your teams understand and trust.

Closing thought

We’re past the point where AI is a science experiment. It’s a fixed cost. And like any fixed cost, where you run the work matters as much as the work you choose to do.

Cloud AI isn’t going away—and it shouldn’t. But if every non-research, low-stakes prompt is hitting your most expensive models in your most expensive environment, you’re leaving easy money on the table and making it harder to see where AI is actually moving the needle.

If you’re wrestling with these questions—how to balance cloud and local, where NPUs and your existing fleet fit, and what “good” looks like for AI spend in a 50,000-person org—I’m always happy to compare notes.

We see AI differently.

Built for

Solutions for

Get Started

Cephable is ambient and personal.

Products

Key Features

Get Started

Empower your workflow

How to Guides

Support

Thought Leadership

Partners

We see a future where friction fades and people flow

About Cephable

Contact

Get Started

Architecting for Heterogeneity: Using Your Fleet to Tame AI Costs

The new AI line item: token + infra burn

Not every prompt deserves the cloud

What AI cloud costs look like in healthcare, finserve, and tech

Financial services: Using frontier models to rewrite paragraphs

Tech companies: Burning tokens to move text between tools

A conservative napkin math: what 20% misallocation costs you

Architect for heterogeneity: start local on whatever silicon you have

Where to start

Closing thought