What OpenAI actually announced

On or around 1 July 2026, OpenAI said it will run its new flagship model, GPT-5.6 Sol, on Cerebras wafer-scale hardware at up to 750 tokens per second, starting in July. Access is initially limited to select customers and will widen as capacity expands. This is not a research demo. It is a production commitment to a specific piece of silicon.

Behind it sits a binding Master Relationship Agreement worth over 20 billion USD, which OpenAI and Cerebras have disclosed. It covers 750 megawatts of wafer-scale inference capacity from 2026 through 2028, with provisions to expand to 2 gigawatts by 2030. GPT-5.6 ships in three sizes priced per one million tokens: Sol at 5 USD input and 30 USD output, which is roughly GBP 3.95 in and GBP 23.70 out; Terra at 2.50 and 15; and Luna at 1 and 6.

The number that matters for owners is not the model name. It is 750 tokens per second, delivered by a named supplier, under a named contract, for a named term.

Speed, not just intelligence, is now the product

A frontier-class model on a traditional GPU cluster streams at roughly 40 to 120 tokens per second. Cerebras disclosed that its wafer-scale approach runs the same model weights up to about 15 times faster than GPU-based systems by placing compute, memory and bandwidth on a single wafer rather than splitting them across many chips.

That difference changes which products are buildable. At 40 to 120 tokens per second, a real-time voice agent stutters, a live code reviewer lags behind the developer, and interactive document analysis feels like waiting. At 750 tokens per second, those latency-bound workloads become viable. The upgrade is not a smarter answer. It is an answer fast enough to sit inside a live workflow.

The speed lives at a single address

Here is the concentration problem. That 750 tokens per second is not a property of the model in the abstract. It is a property of one vendor's wafer-scale silicon, running a model that is itself under US-government access restrictions, in limited preview to around 20 approved companies. Change any one of those three, and the speed you designed around disappears.

For a UK company, this stacks three dependencies that used to be separate. The model is American and export-controlled. The chip is a single American supplier's proprietary wafer. The throughput ceiling is set by a contract you are not a party to. Sovereign inference was once a question about whose chips you run on. It is now also a question about whose tokens per second you rent, and today the answer routes through one US supply chain.

Make tokens-per-second a priced dependency

Treat inference speed the way you already treat any single-source input: as a priced, contestable dependency, not a free upgrade. The first task is to measure. Know the tokens-per-second ceiling your latency-bound features actually need, and know the ceiling your current supplier gives you. If a feature only works above a certain speed, that speed is now part of your product specification.

The second task is to keep a second path. Identify at least one alternative that can carry the same workload, even at lower speed, so that a contract term, an export rule or a capacity limit at one vendor does not silently switch off a live product. For European and UK owners specifically, this is where the sovereign-inference conversation earns its place: not as politics, but as continuity planning for a throughput ceiling you do not control.

The winners of the next phase will not simply hold the smartest model. They will know their tokens-per-second number, know who controls it, and have already priced the cost of losing it.