Recommendation

The following are our main recommendations:

Understanding the problem. Given the many available models and their derived versions, it is essential to select the right model for the tasks at hand. Natural language processing models are generally grouped based on the problem they solve, such as text classification, questions and answers, translation, text generation, clustering, sentence similarity, etc. Framing the problem in the right setting will help to select the best model.
Understanding benchmarks and their use: models are assessed using benchmark datasets for specific tasks. For instance, the Sentence Text Similarity (STS) benchmark is commonly used for semantic text similarity evaluations. These benchmarks can use metrics such as the Person’s correlation coefficient to quantify how well a model’s score aligns with human judgments. Benchmarks can specify other factors such as model size, memory usage, embedding dimensions, maximum token capacity, etc. All these factors are relevant to selecting the best model under certain constraints.
Understanding the technology. A good grasp of embeddings, LLMs, encoders, and information retrieval in general would help build custom pipelines that maximise system performance.
Understanding model performance based on language. Most models have a performance bias towards English. If a model works well in English, this may not be true in another language, such as German.
Understanding how and when to apply translation. Several models perform very well in the translation task; this may allow the translation of all text to English, perform the necessary AI task (matching and retrieval, question and answering, etc.) in English, and then translate the results back to the target language. This will often perform better than working on a non-English language directly.
Understanding text chunking. Models allow to choose a fix length text size for chunking. The chosen size will have a big impact on the model’s performance and its capacity to retrieve relevant passages. Splitting text into meaningful chunks using a semantic parser based on paragraphs or semantic units rather than a fixed size will ensure that text retains relevant context.
Understanding model parameters. Models have several parameters that will affect the outcomes. For example, the LLM temperature parameter will increase the randomness of a model response in the question-and-answer task. If replication of results is important, this parameter should be set to zero. Otherwise, the response will include variations which are difficult to replicate.
Understand your data and how to balance it. Selecting a representative data sample to carry out tests and experiments is important. If data selection is unbalance models can perform well in tests, but this behaviour would not necessarily be translated to the whole data. This is even more relevant if training a model is necessary.
Data preprocessing. The cleaner the data, the better the models will perform, several models have archived comparable results. Thus, effort may be better used in curating data than model parameter tuning.
Know how to use zero-shot and few-shot learning on RAG systems. Use the few-shot learning by providing clear task instructions in the prompt without any specific examples. In the few-shot instance, a few examples of the desired task should be directly included in the prompt.
Finally, we recommend that the deployment planning of a system such as the one discussed in this document should incorporate an adequate risk assessment to mitigate any bias that may be introduced.

PreviousConclusions NextBibliography

Last updated 10 months ago