Skip to main content

Web Page Q&A

Summary

This article describes a web page Q&A implementation using Spring AI. Given a web page, this sample application loads its content into a vector store, then uses LLM to answer user's query based on the content.

The complete source is available on GitHub JavaAIDev/web-page-qa.

Content of web pages are used to provide context for an LLM to answer queries.

Prerequisites

  • Java 21
  • A vector database. pgvector used in the sample.
    • Use the Docker Compose file to start pgvector.
  • Ollama to run local models.
    • bge-large model for text embedding
    • qwen2.5 model for chat completion.
    • Pull models using ollama pull, like ollama pull bge-large.

Load Web Page Content

The first step is to load content of a web page. This is done by using the jsoup library. WebPageReader is an implementation of DocumentReader in Spring AI. The content of a web page is converted to a Document.

Custom X509ExtendedTrustManager

The custom X509ExtendedTrustManager implementation used in WebPageReader trusts all SSL certificates. This is required to load web pages from sites using self-signed or untrusted SSL certificates. This may have security issues.

WebPageReader to read content from a web page
package com.javaaidev.webpageqa.etl;

import java.net.Socket;
import java.security.SecureRandom;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.List;
import java.util.Map;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLEngine;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509ExtendedTrustManager;
import org.jsoup.Jsoup;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;

public class WebPageReader implements DocumentReader {

private static final Logger LOGGER = LoggerFactory.getLogger(WebPageReader.class);

private final String url;

public WebPageReader(String url) {
this.url = url;
}

@Override
public List<Document> get() {
try {
var trustManager = createTrustManager();
var sslContext = SSLContext.getInstance("TLS");
sslContext.init(null, new TrustManager[]{trustManager}, new SecureRandom());
var doc = Jsoup.connect(url).sslSocketFactory(sslContext.getSocketFactory()).get();
return List.of(new Document(doc.body().text(), Map.of(
"url", url
)));
} catch (Exception e) {
LOGGER.error("Failed to load web page", e);
return List.of();
}
}

private X509ExtendedTrustManager createTrustManager() {
return new X509ExtendedTrustManager() {

@Override
public void checkClientTrusted(X509Certificate[] chain, String authType)
throws CertificateException {

}

@Override
public void checkServerTrusted(X509Certificate[] chain, String authType)
throws CertificateException {

}

@Override
public X509Certificate[] getAcceptedIssuers() {
return new X509Certificate[0];
}

@Override
public void checkClientTrusted(X509Certificate[] chain, String authType, Socket socket)
throws CertificateException {

}

@Override
public void checkServerTrusted(X509Certificate[] chain, String authType, Socket socket)
throws CertificateException {

}

@Override
public void checkClientTrusted(X509Certificate[] chain, String authType, SSLEngine engine)
throws CertificateException {

}

@Override
public void checkServerTrusted(X509Certificate[] chain, String authType, SSLEngine engine)
throws CertificateException {

}
};
}

}

Chunking

The content of a web page may be quite long, so the content needs to be split into smaller documents. This step is called chunking. There are different strategies for chunking. The strategy used here is recursive text splitting. Adjacent chunks have overlapped content.

RecursiveTextSplitter to split text
package com.javaaidev.webpageqa.etl;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import java.util.List;
import org.springframework.ai.transformer.splitter.TextSplitter;

public class RecursiveTextSplitter extends TextSplitter {

@Override
protected List<String> splitText(String text) {
var splitter = DocumentSplitters.recursive(500, 100);
return splitter.split(new Document(text))
.stream()
.map(TextSegment::text)
.toList();
}
}

Save Documents

Split documents are saved into a vector store. EtlTask listed below shows the whole process.

EtlTask
package com.javaaidev.webpageqa.etl;

import java.util.List;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.document.DocumentWriter;

public class EtlTask implements Runnable {

private final String webPageUrl;
private final DocumentTransformer documentTransformer;
private final DocumentWriter documentWriter;

private static final Logger LOGGER = LoggerFactory.getLogger(EtlTask.class);

public EtlTask(String webPageUrl, DocumentTransformer documentTransformer,
DocumentWriter documentWriter) {
this.webPageUrl = webPageUrl;
this.documentTransformer = documentTransformer;
this.documentWriter = documentWriter;
}

@Override
public void run() {
LOGGER.info("Load web page: {}", webPageUrl);
var reader = new WebPageReader(webPageUrl);
var docs = documentTransformer.apply(reader.get());
LOGGER.info("{} docs to store", docs.size());
var index = 0;
for (var doc : docs) {
LOGGER.info("Save doc #{}", index + 1);
documentWriter.accept(List.of(doc));
index++;
}
LOGGER.info("Imported documents");
}
}

Q&A

Q&A is implemented using QuestionAnswerAdvisor in Spring AI. We only need to add this advisor when creating ChatClient. A VectorStore is required when creating this advisor.

Create ChatClient
package com.javaaidev.webpageqa.chat;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class ChatModuleConfiguration {

@Bean
public ChatClient chatClient(ChatClient.Builder builder,
VectorStore vectorStore) {
return builder.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore)).build();
}
}

In the REST controller, we only need to provide the original user input. QuestionAnswerAdvisor will search the vector store for similar documents and include them in the prompt sent to an LLM.

REST controller
package com.javaaidev.webpageqa.chat;

import com.javaaidev.chatagent.model.ChatAgentRequest;
import com.javaaidev.chatagent.model.ChatAgentResponse;
import com.javaaidev.chatagent.springai.ModelAdapter;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.messages.Message;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;
import reactor.core.publisher.Flux;

@RestController
public class ChatController {

private final ChatClient chatClient;

public ChatController(ChatClient chatClient) {
this.chatClient = chatClient;
}

@PostMapping("/chat")
public Flux<ServerSentEvent<ChatAgentResponse>> chat(@RequestBody ChatAgentRequest request) {
return ModelAdapter.toStreamingResponse(
chatClient.prompt()
.messages(ModelAdapter.fromRequest(request).toArray(new Message[0]))
.stream()
.chatResponse());
}
}

Test

Now we can test the REST API. Chat Agent UI is added to this project, so it can be used to test the application.

Start the server and use the UI (http://localhost:8080/webjars/chat-agent-ui/index.html) to test.

See the screenshot below.