Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness

Nathan M. Wolfrath; Nathaniel B. Verhagen; Bradley H. Crotty; Melek Somai; Anai  N. Kothari

doi:10.3791/66796

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

In this protocol, foundation large language model response quality is improved via augmentation with peer-reviewed, domain-specific scientific articles through a vector embedding mechanism. Additionally, code is provided to aid in performance comparison across large language models.

Abstract

Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.

Introduction

Large language models (LLMs) such as OpenAI's ChatGPT or Meta AI's Llama have rapidly become a popular resource for generating text relevant to a user prompt. Originally functioning to predict the next lexical items in a sequence, these models have evolved to understand context, encode clinical information, and demonstrate high performance on a variety of tasks¹^,²^,³^,⁴. Though language models predate such capabilities and their current level of popularity by decades⁵, recent advances in deep learning and comput....

Protocol

In the use case demonstrated in this paper, the vector store was generated using published guidelines from the Chicago Consensus Working Group¹⁷. This expert group was established to develop guidelines for the management of peritoneal cancers. The subject area was chosen as it is within the investigators' area of clinical expertise. The set of papers was accessed from online journal repositories including Cancer and the Annals of Surgical Oncology. A compact (33.4M parameters) embedding model .......

Representative Results

A set of 22 publications from the Chicago Consensus Working Group management guidelines were used to augment the base Llama-7b model¹⁷. The documents were converted into a vector index using the tool Llama-Index to generate Llama-2-7b-CCWG-Embed. Popular OpenAI models such as GPT-3.5 and GPT-4 were also augmented in a similar fashion to produce GPT-XX-CCWG-Embed models. A total of 20 multiple choice questions (MCQ) were developed to assess knowledge related to the management of a variety of perito.......

Discussion

The methods provided here aim to facilitate the research of domain-specific applications of LLMs without the need for de novo training or extensive fine-tuning. As LLMs are becoming an area of significant research interest, approaches for augmenting knowledge bases and improving the accuracy of responses will become increasingly important¹⁸^,¹⁹^,²⁰^,²¹. As demonstrated in the provided res.......

Acknowledgements

This work was facilitated by several open-source libraries, most notably llama-index (https://www.llamaindex.ai/), ChromaDB (https://www.trychroma.com/), and LMQL (https://lmql.ai/).

....

Materials

Name	Company	Catalog Number	Comments
pip3 version 22.0.2
Python version 3.10.12

References

Singhal, K., et al. Large language models encode clinical knowledge. Nature. 620 (7972), 172-180 (2023).
Gilson, A., et al. How does ChatGP....

This article has been published

Video Coming Soon

Keep me updated: