Answer Generation Evaluation

Goal To evaluate the generator component of the RAG system, assessing how well the responses answer the question and are supported by the provided context. Data Set of questions, retrieved chunks, and generated responses from Kritisches Denken (serviceID 2699) and Agiles Mindset (serviceID 2700). 20 content-related questions for each course. Method/Approach Replicating the RAG pipeline, doing a vector search in OpenSearch to retrieve relevant chunks (context). The retrieved chunks and query were passed to the LLM, which generated an answer. The generated answers were evaluated by the LLM for relevance and groundedness. Results For both KD and AM, the average relevance score was 0.89, indicating that most answers were highly relevant to the queries. Groundedness scores varied: KD (0.75); Agiles Mindset (0.83). Evaluation Metrics Answer Relevance: Evaluates whether the final response addresses the entirety of the user’s question. Groundedness: Evaluates if the generated response is based on the context provided by assessing the information overlap between each sentence of the response and relevant parts of the context. Scores produced by the LLM are mapped to a range from 0 to 1 (0 = not relevant/not grounded, 1 = relevant/grounded) and averaged to get an overall score. Conclusions Most answers were highly relevant to the queries. Outliers with low scores revealed two main issues: retrieval failures where the correct chunk wasn’t retrieved, and content limitations where the available context was insufficient to fully answer the question.

PreviousRetrieval of course candidates when searching via title.NextChunk Size and Retrieval Evaluation

Last updated 10 months ago