AI Applications
Last updated
Last updated
This section presents several demonstrators that were created to demonstrate our findings. For instance, Figure 12 shows the final API to suggest ESCO skills for course annotation using embedders, a bi-encoder and cross-encoder.
This API follows a client/server architecture and has been deployed without major issues for over a year.
A bi-encoder with a combined generative response integrates a dual-encoder retrieval system with a generative model to efficiently retrieve relevant information and generate natural, context-aware responses. The process begins when a user submits a query. A bi-encoder handles retrieval by encoding the query and documents into embedding vectors using transformer-based models like BERT or RoBERTa [4]. The query embedding is compared to the precomputed document embeddings using a similarity metric, such as cosine similarity or dot product, to retrieve the most relevant documents.
After retrieval, the system processes the documents to extract the most relevant information. This step may involve selecting specific sections (chunking), summarising lengthy content, or filtering out irrelevant details to ensure concise and focused input for the generative model.
Next, the retrieved content and the original query are passed to a generative model to produce a coherent response. The query and retrieved-context are combined into a structured input.
Finally, the system post-processes the generated response to clean up redundant or unnecessary text, improve clarity, and ensure factual accuracy if additional verification is applied.
This approach combines the efficiency of bi-encoders for retrieval with the natural language generation capabilities of generative models. The result is a system that delivers accurate, dynamic, and contextually relevant responses while leveraging the scalability of precomputed document embeddings and ensuring the generative output remains grounded in retrieved information.
Retrieval-Augmented Generation (RAG) is a framework designed for tasks like document Question-and-Answering (Q&A). It combines a retrieval component to fetch relevant information and a generation component to produce coherent responses.
The process begins when a user provides a question or query. The system uses a retrieval model to fetch relevant documents or passages from a knowledge base. This knowledge base is pre-processed into a searchable format, often using embeddings. Retrieval typically relies on methods like Dense Passage Retrieval (DPR) or vector similarity searches, such as FAISS [5], but traditional approaches like BM25 can also be used.
To retrieve the most relevant documents, the input query is first converted into an embedding using a model like BERT. The query embedding is then compared to the document embeddings using measures like cosine similarity to rank relevance. The top-k most relevant documents are selected, where 𝑘 is a parameter that can be tuned.
Sometimes, the retrieved documents are further processed or condensed to ensure that only the most relevant portions are used. This step may include techniques like chunking, summarisation, or filtering. The refined information is then passed to the generation module.
The generation model takes the query and the retrieved documents as input to produce the final answer. The input to the generator typically combines the query and the retrieved content, such as in the format: "Question: <query>. Context: <retrieved documents>." The generator, often a model like GPT or T5, uses this input to craft a coherent and contextually relevant response. The output may also undergo post-processing to clean or format the response for the user.
Optionally, RAG systems can include a feedback loop to refine their performance. This might involve re-ranking the relevance of retrieved documents or fine-tuning the generative model based on user feedback.
RAG offers several advantages. It can scale to work with large, dynamic knowledge bases, which can be updated independently of the generative model. Grounding the generation in specific retrieved content, it reduces the likelihood of generating hallucinated or inaccurate answers, ensuring that responses are both accurate and dynamic.
This approach is widely applied in document Q&A systems, customer support, open-domain Q&A, and research assistance, making it an effective framework for integrating retrieval and generation in a single pipeline.
BM25, cross-encoders and RAG can be combined to create a highly effective and efficient Q&A system by leveraging their complementary strengths. Together, they can provide accurate retrieval, precise ranking, and coherent response generation.
The process starts with BM25, which acts as the initial retrieval mechanism. When a user submits a query, BM25 retrieves a broad set of potentially relevant documents from the database base. This step is computationally efficient and quickly narrows down the vast pool of documents to a manageable subset based on keyword matching and term frequency scoring.
Next, the retrieved documents are re-ranked using a cross-encoder for more precise scoring. The query and each document are paired and jointly encoded by a transformer model, which analyses their relationship in detail. The cross-encoder assigns a relevance score to each query-document pair, ranking the documents based on how well they answer the query. This step ensures that the most relevant and contextually appropriate documents are prioritised.
Finally, the top-ranked documents are passed into a RAG framework for response generation. The user’s query, along with the retrieved and re-ranked documents, is fed into a generative model like GPT, Llama, T5, etc. The model uses the provided context to produce a coherent, natural language response that is both accurate and grounded in the retrieved documents.
By combining these three methods, the system benefits from BM25’s speed for initial retrieval, the cross-encoder’s precision for re-ranking, and RAG’s ability to generate detailed and contextually appropriate answers. This integration ensures that the Q&A system is both efficient and capable of delivering high-quality responses.
Notice that this proposed setting is a variation of the previous architecture, where instead of using BM25, a bi-encoder was used. Thus, the question is which architecture should be used. In simple terms, currently, there is no right answer to this question. The dataset and preprocessing (e.g. chunking) will largely influence the results.
For our last demonstrator, we used llama3.2 LLM model running on-premises under Ollama [6]. In practice, Ollama works like ChatGPT, but it is open source, and it can be run within a closed system, making it ideal when privacy is a priority and flexibility using other models is required. Additionally, Ollama offers several state-of-the-art models with a thriving community that ensures that the model is updated often.
For this demonstrator, we used the llama3.2 as a base model to create our own llama3.2, which has been instructed to act as an examiner to grade students’ exam questions. For visualisation and testing purposes, a UI was created which is shown in Figure 15.
Initial tests are encouraging, we have tested this system with Q&A from tests provided by Hochschule Schmalkalden. We have observed that the shorter the student’s responses, the more accurate the assigned grade is. We are confident that these preliminary results can be improved. Thus, we plan to continue this work, focusing on training the models with well-curated data and the creation of an API that can be integrated into other services.