Key concepts

This section examines the key necessary techniques to understand the work carried out, aiming to provide a general overview of the methods used rather than the technical details. We start with the BM25 algorithm, widely accepted as one of the best methods to search text literals. We follow by giving a general explanation of embeddings, LLMs and two methods used to help retrieve (bi-encoder) and classify (cross-encoder) text. Finally, we introduce the concept of Retrieval-Augmented Generation.

  1. BM25: Best Matching 25 [1] is a commonly used ranking algorithm in information retrieval that evaluates the relevance of a document (e.g. paragraph, text snippets or chunks) to a query by considering how frequently the query terms appear in the document (term frequency, TF), how rare those terms are across all documents (inverse document frequency, IDF), and normalising for document length to avoid bias toward longer documents, making it effective for keyword-based search in systems like search engines and text databases. This algorithm is the baseline for academic and industrial applications, and it is implemented in search engines such as Elasticsearch1, Apache Lucene2 and Solr3.

  2. Embeddings: These are built on the transformer architecture and represent text as dense numerical vectors in a high-dimensional space. These embeddings capture the meaning and context of words, phrases, or sentences by processing text through layers of self-attention and feed-forward neural network mechanisms. Unlike static embeddings such as Word2Vec, transformer embeddings are contextual, meaning the representation of a word adapts based on the surrounding words. For instance, in "a light bulb" versus "travel light," the word "light" is represented differently because the model understands the context in which it occurs. This ability to model relationships bidirectionally, considering both preceding and following words, allows transformer embeddings to encode rich and nuanced language information. Embeddings are fundamental to many NLP applications due to their capacity to generalise across tasks. Pre-trained transformer models like GPT, T5, or DistilBERT learn embeddings from vast quantities of text data, capturing linguistic patterns, syntax, and semantics. These embeddings can then be fine-tuned or directly applied to tasks such as text classification, machine translation, or summarisation. The transformer architecture's attention mechanism ensures that relationships between all words in a sentence are considered, enabling the embeddings to represent complex dependencies.

  3. Bi-encoder: A bi-encoder architecture takes two separate neural networks to independently encode two inputs (e.g., a query and a document) into fixed-size vector representations, which are then compared (e.g., via cosine similarity) to measure their semantic relevance or similarity. An alternative to this configuration is to use the same neural network to encode both inputs (e.g., query and document), which generally is referred to as a shared-weight bi-encoder, this configuration is the one used in this project to which we refer by its general bi-encoder name.

  4. Cross-encoder: A cross-encoder architecture takes two inputs (e.g., a query and a document) that are processed together by a single neural network, allowing it to directly evaluate their interaction and output a relevance score or classification, often achieving higher accuracy than bi-encoders at the cost of slower performance.

  5. LLM: Large Language Models (LLMs) are advanced neural networks designed to process, understand, and generate human-like text, images, videos and other digital signals. They are built with billions of parameters and trained on massive datasets, using techniques like pretraining (on general text) and fine-tuning (for specific tasks). LLMs leverage the transformer [2] architecture for efficient language understanding and generation. Some key applications include: chatbots, virtual assistants, and content creation; language translation and summarisation; code generation and debugging; Domain-specific tasks (e.g., legal, medical, scientific analysis); education and personalised learning tools. The challenges of LLMs include bias in training data, ethical concerns about misuse, high computational costs, and their limited ability to understand context and meaning.

  6. RAG: Retrieval-Augmented Generation is a cutting-edge approach in machine learning that integrates the know-how of information retrieval systems, which fetch relevant documents or facts from external knowledge bases, with generative language models, which process and synthesise this information alongside the user’s query to generate coherent and context-aware responses, making it ideal for applications like question answering, summarisation, and conversational AI.

1 https://www.elastic.co/elasticsearch

2 https://lucene.apache.org

3 https://solr.apache.org

Last updated