Skip to main content

Streaming

When using AI assistants like ChatGPT or Claude, you'll notice that response text from the AI assistant appears in a streaming fashion. Comparing to waiting for the full response for a long time, streaming partial response when it's ready offers a much better user experience.

To enable streaming response when using Spring AI with your application, the following conditions must be satisfied:

  • AI service supports streaming.
  • Spring AI client of this AI service supports streaming.
  • Your application backend supports streaming.
  • Your application frontend supports streaming.

Most of the AI services you'll use support streaming, so this won't be an issue. For Spring AI, the client of an AI service must implement the StreamingModel interface. This interface has a stream method to return a Flux of responses. For ChatModels, StreamingChatModel returns a Flux<ChatResponse>.

For an application, the typical way to implement streaming is using Server-sent events. Spring provides support for Server-sent events using ServerSentEvent. All we need to do is to convert the Flux<ChatResponse> to a Flux<ServerSentEvent>. For the data of a server-sent event, it can be a plain text or JSON string.

For the frontend, we can use the built-in EventSource or third-party libraries like eventsource to consume server-sent events.

There are two streaming modes, full and incremental.

Full

When using full streaming mode, each response in the streaming contains the full response generated so far. For example, if the whole response is Hi, how can I help you?. The streaming responses may look like below:

Hi,
Hi, How
Hi, How can I
Hi, how can I help
Hi, how can I help you?

If the streaming response is using full mode, we only need to take the latest element in the stream.

Incremental

When using incremental streaming mode, each response in the streaming contains the generated partial response. For example, if the whole response is Hi, how can I help you?. The streaming responses may look like below:

Hi,
How
can I
help
you?

If the streaming response is using incremental mode, we only need to aggregate the elements in the stream to get the full response. The aggregation can be done at server-side or client-side.