LLM-based Evaluator for Context Relevance
Goal Assessing how well an LLM can identify relevant context chunks given question-text pairs so it can be used in a second-step of our retrieval system to filtered out irrelevant candidates. Data Modules Agiles Mindset, Kritisches Denken. 20 test questions from each module. Method/Approach LLM-based relevance scores from comparing question-chunk pairs. Prompt taken from TruLens Context Relevance.The LLM rates the context text chunk from 1 to 10. The score is normalized to a 0-1 scale. LLM used: gpt-4-0125-preview Results Recall of 100% in both Agiles Mindset and Kritisches Denken data. Precision: AM (71%) KD(33%). Accuracy: AM (81%) KD (52%) Evaluation Metrics Accuracy: The percentage of correctly predicted values. Recall: True positives / sum of true positives and false negatives (actual positives). Crucial when cost of false negatives is high. (predicted as 0 when it is in fact relevant) Precision: True positive/ sum of true positives and false positives (Total predicted positives '1').Crucial when cost of false positives is high. (predicted as 1 when it is not relevant). Conclusions The LLM-based context relevance evaluator was very good at correctly predicting all relevant chunks as relevant (100% recall). However, accuracy and precision was very different depending on the data evaluated. It was less accurate in general for the case of Kritisches Denken (52% accuracy) than for Agiles Mindset (71% accuracy).
Last updated