EduPLEx_API
InfoPrototypeAll docs
Recommendation, reporting & analytics
Recommendation, reporting & analytics
  • Experiments report
    • Key concepts
    • Data sources
    • First demonstrator: ESCO ontologies and semantic matching
    • Software design
      • Endpoints Sbert_eduplex
      • Setup Sbert_eduplex
    • AI Applications
    • Conclusions
    • Recommendation
    • Bibliography
  • Recommendation Engine
  • Reporting and predictive analytics
  • LRS User Journey Visualizer
  • AI Tutor - RAG system
    • LLM-augmented Retrieval and Ranking for Course Recommendations
    • Retrieval of course candidates when searching via title.
    • Answer Generation Evaluation
    • Chunk Size and Retrieval Evaluation
    • Chunking Techniques – Splitters
    • Golden Case CLAPNQ
    • Comparative Retrieval Performance: Modules vs Golden Case
    • LLM-based Evaluator for Context Relevance
    • Retrieval Performance Indexing pdf vs xapi, and Keywords vs Questions
Powered by GitBook
On this page
Edit on GitLab
  1. AI Tutor - RAG system

Answer Generation Evaluation

Goal To evaluate the generator component of the RAG system, assessing how well the responses answer the question and are supported by the provided context. Data Set of questions, retrieved chunks, and generated responses from Kritisches Denken (serviceID 2699) and Agiles Mindset (serviceID 2700). 20 content-related questions for each course. Method/Approach Replicating the RAG pipeline, doing a vector search in OpenSearch to retrieve relevant chunks (context). The retrieved chunks and query were passed to the LLM, which generated an answer. The generated answers were evaluated by the LLM for relevance and groundedness. Results For both KD and AM, the average relevance score was 0.89, indicating that most answers were highly relevant to the queries. Groundedness scores varied: KD (0.75); Agiles Mindset (0.83). Evaluation Metrics Answer Relevance: Evaluates whether the final response addresses the entirety of the user’s question. Groundedness: Evaluates if the generated response is based on the context provided by assessing the information overlap between each sentence of the response and relevant parts of the context. Scores produced by the LLM are mapped to a range from 0 to 1 (0 = not relevant/not grounded, 1 = relevant/grounded) and averaged to get an overall score. Conclusions Most answers were highly relevant to the queries. Outliers with low scores revealed two main issues: retrieval failures where the correct chunk wasn’t retrieved, and content limitations where the available context was insufficient to fully answer the question.

PreviousRetrieval of course candidates when searching via title.NextChunk Size and Retrieval Evaluation

Last updated 3 months ago