Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - LLM

81 Articles
article-image-evaluating-large-language-models
Vivekanandan Srinivasan
27 Oct 2023
8 min read
Save for later

Evaluating Large Language Models

Vivekanandan Srinivasan
27 Oct 2023
8 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLLM is the Large Language Model or the advanced artificial intelligence algorithms usually trained with vast amounts of text data. Such language models help to generate human-like languages. These models can also perform language-related tasks, including translation, text, competition, answering specific questions, and more.In this technological advancement era, several large language models are on the rise. Despite this, no standardized or fixed measures are used to compare or evaluate the quality of large language models.Here, let us dive into the existing evaluation and compare the framework for large language models. Also, we will analyze the factors on which these large language models should be evaluated.Evaluating Large Language ModelsNeed for a comprehensive evaluation framework Identifying different areas of improvement during the early developmental stages is relatively easy. However, with the advancement of technology and the availability of new alternatives, determining the best becomes increasingly tricky. Therefore, it is essential to have a reliable evaluation framework, helping to judge the quality of large language models accurately. Besides, the need for an immediate, authentic evaluation framework becomes imperative. One can use such a framework in the following ways.Only a proper framework will help the authorities and agencies to assess the accuracy, safety, usability issues, and reliability of the model.The blind race among the big technical companies to release large language models is on the rise. Hence, with the development of a comprehensive evaluation framework, one can help stakeholders to remove the model more responsibly.The comprehensive evaluation framework would help the user of large language models determine how and where to fine-tune the model to enable practical deployment.Issues with the existing framework  Every large language model has its advantages. However, certain factors are an issue and make the frameworks insufficient. Some of these issues includeSafety: Some of the framework does not consider protection a factor for evaluation. Although the open AI moderation API addresses safety to some extent, it is insufficient.Self-sufficiency: Regarding factors, one can evaluate the models; the frameworks are scattered. All of these frameworks need to be more comprehensive to be self-sufficient.Factors to be considered while evaluating large language modelsOnly after reviewing the existing evaluation framework can one determine the factors that must be considered while assessing the quality of large language models.Here are the key factors:Model Size and ComplexityThe primary factors to evaluate in LLMs are their size and complexity. It often gets indicated by the number of parameters. Generally, larger models have a greater capacity to understand context and generate nuanced responses. With the advent of huge models, one might require substantial computational resources, making them impractical for specific applications. Evaluators must balance model size and computational efficiency based on the use case.Training Data Quality and DiversityThe training data's quality and diversity significantly influence LLMs' performance. As users, we know that models get trained on diverse and representative datasets from various sources and tend to have a broader understanding of language nuances. However, evaluators should scrutinize the sources and types of data used for training to ensure the model's reliability across different contexts and domains.Bias and FairnessBias in LLMs is a critical concern, as it can generate discriminatory or unfair content. Evaluators must assess the model's bias, both in the training data and the generated output, and implement strategies to mitigate biases. Besides, ethical considerations demand continuous efforts to improve fairness, ensuring that the models do not reinforce societal biases.Ethical Considerations and Responsible UseEvaluating LLMs extends beyond technical aspects to ethical considerations. Responsible deployment of these models requires a thorough assessment of potential misuse scenarios. In every case, evaluators must devise guidelines and ethical frameworks to prevent generating harmful or malicious content, emphasizing the responsible use of LLMs in applications such as content moderation and chatbots.Fine-Tuning and Transfer Learning LLMs are often fine-tuned on specific datasets to adapt them to particular tasks or domains. One should scrutinize the fine-tuning process to ensure the model maintains its integrity and performance while being customized. Additionally, assessing the effectiveness of transfer learning, where models trained on one task are applied to related tasks, is crucial for understanding their adaptability and generalizability.Explainability and InterpretabilityUnderstanding how LLMs arrive at specific conclusions is essential, especially in applications like legal document analysis and decision-making processes. Being an evaluator, one must assess the model's explainability and interpretability. Transparent models enable users to trust the generated output and comprehend the reasoning behind the responses, fostering accountability and reliability.Robustness and Adversarial Attacks Evaluating the robustness of LLMs involves assessing their performance under various conditions, including noisy input, ambiguous queries, or adversarial attacks. Rigorous testing against potential negative inputs helps identify vulnerabilities and weaknesses in the model, guiding the implementation of robustness-enhancing techniques.Continuous Monitoring and ImprovementThe landscape of language understanding is ever-evolving. Continuous monitoring and improvement are vital aspects of evaluating LLMs. Regular updates, addressing emerging challenges, and incorporating user feedback contribute to the model's ongoing enhancement, ensuring its relevance and reliability over time.Step-by-Step Guide: Comparing LLMs Using Perplexity1. Load Language Model: Load the pre-trained LLM using a library like Hugging Face Transformers.2. Prepare Dataset: Tokenize and preprocess your dataset for the language model.3. Train/Test Split: Split the dataset into training and testing sets.4. Train LLM: Fine-tune the LLM on the training dataset.5. Calculate Perplexity: Use the testing dataset to calculate perplexity.Code example: # Calculate Perplexityfrom math import exp from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2") input_text = "Example input text for perplexity calculation." input_ids = tokenizer.encode(input_text, return_tensors="pt") with torch.no_grad():    output = model(input_ids)    loss = output.loss perplexity = exp(loss) print("Perplexity:", perplexity)Methods of evaluation Quantitative Performance Metrics and Benchmarking Evaluating LLMs requires rigorous quantitative assessment using industry-standard metrics. BLEU, METEOR, and ROUGE scores are pivotal in assessing text generation quality by comparing generated text with human references. For translation tasks, BLEU (Bilingual Evaluation Understudy) calculates the overlap of n-grams between the machine-generated text and human reference translations. METEOR evaluates precision, recall, and synonymy, providing a nuanced understanding of translation quality. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) emphasizes summary evaluation, emphasizing memory. These metrics offer quantitative benchmarks, enabling direct comparison between different LLMs. Additionally, perplexity, a measure of how well a language model predicts a sample text, provides insights into language model efficiency. Lower perplexity values indicate better prediction accuracy, highlighting the model's coherence and understanding of the input text. Often applied to large-scale datasets like WMT (Workshop on Machine Translation) or COCO (Common Objects in Context), these quantitative metrics, off LLM, are a robust foundation for comparing LLMs' performance.Diversity Analysis and Bias Detection Diversity and bias analysis are paramount in evaluating LLMs, ensuring equitable and inclusive performance across diverse demographics and contexts. One critical approach involves employing word embedding techniques, such as Word Embedding Association Test (WEAT), to quantify biases. WEAT assesses associations between word embeddings and predefined categories, unveiling tendencies present in LLMs. By evaluating gender, race, or cultural preferences, organizations can ensure fair and unbiased responses, aligning with ethical considerations.Furthermore, demographic diversity analysis measures the model's performance across different demographic groups. Assessing demographic parity ensures that LLMs provide consistent, unbiased results across various user segments. This comprehensive evaluation approach, deeply rooted in fairness and inclusivity, is pivotal in selecting socially responsible LLMs.Real-World User Studies and Interaction AnalysisIncorporating real-world user studies and interaction analysis is indispensable for evaluating LLMs in practical scenarios. Conducting user tests and surveys provides qualitative insights into user satisfaction, comprehension, and trust. These studies consider how well LLM-generated content aligns with users' expectations and domain-specific contexts.Additionally, analyzing user interactions with LLM-generated content through techniques like eye-tracking studies and click-through rates provides valuable behavioral data. Heatmap analysis, capturing user attention patterns, offers insights into the effectiveness of LLM-generated text elements. User feedback and interaction analysis inform iterative improvements, ensuring that LLMs are technically robust, user-centric, and aligned with real-world application requirements.ConclusionWith the development of large language models, natural language processing experienced a revolution. However, the need for a standardized and comprehensive evaluation framework remains a necessity. It helps in assessing the quality of these LLM models. Though the existing framework offers valuable insights, it needs more standardization and comprehensiveness. At the same time, it does not consider safety as an evaluation factor. Moreover, collaborating with relevant experience becomes imperative to build a comprehensive and authentic evaluation framework for the large language models.Author BioVivekanandan, a seasoned Data Specialist with over a decade of expertise in Data Science and Big Data, excels in intricate projects spanning diverse domains. Proficient in cloud analytics and data warehouses, he holds degrees in Industrial Engineering, Big Data Analytics from IIM Bangalore, and Data Science from Eastern University.As a Certified SAFe Product Manager and Practitioner, Vivekanandan ranks in the top 1 percentile on Kaggle globally. Beyond corporate excellence, he shares his knowledge as a Data Science guest faculty and advisor for educational institutes.
Read more
  • 0
  • 0
  • 19498

article-image-build-your-llm-powered-personal-website
Louis Owen
11 Sep 2023
8 min read
Save for later

Build Your LLM-Powered Personal Website

Louis Owen
11 Sep 2023
8 min read
IntroductionSince ChatGPT shocked the world with its capability, AI started to be utilized in numerous fields: customer service assistant, marketing content creation, code assistant, travel itinerary planning, investment analysis, and you name it.However, have you ever wondered about utilizing AI, or more specifically Large Language Model (LLM) like ChatGPT, to be your own personal AI assistant on your website?A personal website, usually also called a personal web portfolio, consists of all the things we want to showcase to the world, starting from our short biography, work experiences, projects we have done, achievements, paper publications, and any other things that are related to our professional work. We put this website live on the internet and people can come and see all of its content by scrolling and surfing the pages.What if we can change the User Experience a bit from scrolling/surfing to giving a query? What if we add a small search bar or widget where they can directly ask anything they want to know about us and they’ll get the answer immediately? Let’s imagine a head-hunter or hiring manager who opened your personal website. In their mind, they already have specific criteria for the potential candidates they want to hire. If we put a search bar or any type of widget on our website, they can directly ask what they want to know about us. Hence, improving our chances of being approached by them. This will also let them know that we’re adapting to the latest technology available in the market and surely will increase our positive points in their scoring board.In this article, I’ll guide you to build your own LLM-Powered Personal Website. We’ll start by discussing several FREE-ly available LLMs that we can utilize. Then, we’ll go into the step-by-step process of how to build our personal AI assistant by exploiting the LLM capability as a Question and Answering (QnA) module. As a hint, we’ll use one of the available task-specific models provided by AI21Labs as our LLM. They provide a 3-month free trial worth $90 or 18,000 free calls for the QnA model. Finally, we’ll see how we can put our personal AI assistant on our website.Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to build your own LLM-powered personal website!Freely Available LLMsThe main engine of our personal AI assistant is an LLM. The question is what LLM should we use?There are many variants of LLM available in the market right now, starting from open-source to closed-source LLM. There are two main differences between open-source and closed-source. Open-source LLMs are absolutely free but you need to host it by yourself. On the other hand, closed-source LLMs are not free but we don’t need to host it by ourselves, we just need to call an API request to utilize it.As for open-source LLM, the go-to LLM for a lot of use cases is LLaMA-2 by Meta AI. Since LLM consumes a large amount of GPU memory, in practice we usually perform 4-bit quantization to reduce the memory usage. Thanks to the open-source community, you can now directly use the quantized version of LLaMA-2 in the HuggingFace library released by TheBloke. To host the LLM, we can also utilize a very powerful inference server called Text Generation Inference (TGI).The next question is whether there are freely available GPU machines out there that we can use to host the LLM. We can’t use Google Colab since we want to host it on a server where the personal website can send API requests to that server. Luckily, there are 2 available free options for us: Google Cloud Platform and SaturnCloud. Both of them offer free trial accounts for us to rent the GPU machines.Open-source LLM like LLaMA-2 is free but it comes with an additional hassle which is to host it by ourselves. In this article, we’ll use closed-source LLM as our personal AI assistant instead. However, most closed-source LLMs that can be accessed via API are not free: GPT-3.5 and GPT-4 by OpenAI, Claude by Anthropic, Jurassic by AI21Labs, etc.Luckily, AI21Labs offers $90 worth of free trial for us! Moreover, they also provide task-specific models that are charged based on the number of API calls, not based on the number of tokens like in other most closed-source LLMs. This is surely very suitable for our use case since we’ll have long input tokens!Let’s dive deeper into AI21Labs LLM, specifically the QnA model which we’ll be using as our personal AI assistant!AI21Labs QnA LLMAI21Labs provides numerous task-specific models, which offer out-of-the-box reading and writing capabilities. The LLM we’ll be using is fine-tuned specifically for the QnA task, or they called it the “Contextual Answers” model. We just need to provide the context and query, then it will return the answer solely based on the information available in the context. This model is priced at $0.005 / API request, which means with our $90 free trial account, we can send 18,000 API calls! Isn’t it amazing? Without further ado, let’s start building our personal AI assistant!1.    Create AI21Labs Free Trial AccountTo use the QnA model, we just need to create a free trial account on the AI21Labs website. You can follow the steps from the website, it’s super easy just like creating a new account on most websites.2.    Enter the PlaygroundOnce you have the free trial account, you can go to the AI21Studio page and select “Contextual Answers” under the “Task-Specific Models” button in the left bar. Then, we can go to the Playground to test the model. Inside the Playground of the QnA model, there will be 2 input fields and 1 output field. As for input, we need to pass the context (the knowledge list) and the query. As for output, we’ll get the answer from the given query based on the context provided. What if the answer doesn’t exist in the context? This model will return “Answer not in documents.” as the fallback.3.    Create the Knowledge ListThe next and main task that we need to do is to create the knowledge list as part of the context input. Just think of this knowledge list as the Knowledge Base (KB) for the model. So, the model is able to answer the model only based on the information available in this KB.4.    Test with Several QueriesMost likely, our first set of knowledge is not exhaustive. Thus, we need to do several iterations of testing to keep expanding the list while also maintaining the quality of the returned answer. We can start by creating a list of possible queries that can be asked by our web visitors. Then, we can add several answers for each of the queries inside the knowledge list. Pro tip: Once our assistant is deployed on our website, we can also add a logger to store all queries and responses that we get. Using that log data, we can further expand our knowledge list, hence making our AI assistant “smarter”.5.    Embed the AI Assistant on Our WebsiteUntil now, we just played with the LLM in the Playground. However, our goal is to put it inside our web portfolio. Thanks to AI21Labs, we can do it easily just by adding the JavaScript code inside our website. We can just click the three-dots button in the top right of the “context input” and choose the “Code” option. Then, a pop-up page will be shown, and you can directly copy and paste the JavaScript code into your personal website. That’s it!ConclusionCongratulations on keeping up to this point! Hopefully, I can see many new LLM-powered portfolios developed after this article is published. Throughout this article, you have learned how to build your own LLM-powered personal website starting from the motivation, freely available LLMs with their pros and cons, AI21Labs task-specific models, creating your own knowledge list along with some tips, and finally how to embed your AI assistant in your personal website. See you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.
Read more
  • 0
  • 0
  • 19427

article-image-using-langchain-for-large-language-model-powered-applications
Avratanu Biswas
15 Jun 2023
5 min read
Save for later

Using LangChain for Large Language Model — Powered Applications

Avratanu Biswas
15 Jun 2023
5 min read
This article is the second part of a series of articles, please refer to Part 2 for learning how to Get to grips with LangChain framework and how to utilize it for building LLM-powered AppsIntroductionLangChain is a powerful and open-source Python library specifically designed to enhance the usability, accessibility, and versatility of Large Language Models (LLMs) such as GPT-3 (Generative Pre-trained Transformer 3), BERT(Bidirectional Encoder Representations from Transformers), BLOOM (BigScience Large Open-science Open-access Multilingual Language Model). It provides developers with a comprehensive set of tools to seamlessly combine multiple prompts, creating a harmonious orchestra for working with LLMs effortlessly. The project was initiated by Harrison Chase, with the first commit made in late October 2022. In just a few weeks, LangChain gained immense popularity within the open-source community. Image 1: The popularity of the LangChain Python libraryLangChain for LLMsTo fully grasp the fundamentals of LangChain and utilize it effectively — understanding the fundamentals of LLMs is essential. In simple terms, LLMs are sophisticated language models or AI systems that have been extensively trained on massive amounts of text data to comprehend and generate human-like language. Albeit their powerful capabilities, LLMs are generic in nature i.e. lacking domain-specific knowledge or expertise. For instance, when addressing queries in fields like medicine or law, while an LLM can provide general insights, it, however, may struggle to offer in-depth or nuanced responses that require specialized expertise. Alongside such limitations, LLMs are susceptible to biases and inaccuracies present in training data which can yield contextually plausible, yet incorrect outputs. This is where LangChain shines — serving as an open-source library that leverages the power of LLMs and mitigates their drawbacks by providing abstractions and a diverse range of modules, akin to Lego blocks, thus facilitating intuitive integration with other tools and knowledge bases.In brief, LangChain presents a useful approach for handling text data, wherein the initial step involves preprocessing of the large corpus by segmenting it into smaller chunks or summaries. These chunks are then transformed into vector representations, enabling efficient comparisons and retrieval of similar chunks when questions are posed. This approach of preprocessing, real-time data collection, and interaction with the LLM is not only applicable to the specific context but can also be effectively utilized in other scenarios like code and semantic search.Image 2 - Typical workflow of Langchain ( Image created by Author)A typical workflow of LangChain involves several steps that enable efficient interaction between the user, the preprocessed text corpus, and the LLM. Notably, the strengths of LangChain lie in its provision of an abstraction layer, streamlining the intricate process of composing and integrating these text components, thereby enhancing overall efficiency and effectiveness.Key Attributes offered by LangChainThe core concept behind LangChain is its ability to connect a “Chain of thoughts” around LLMs, as evident from its name. However, LangChain is not limited to just a few LLMs —  it provides a wide range of components that work together as building blocks for advanced use cases involving LLMs. Now, let’s delve into the various components that the LangChain library offers, making our work with LLMs easier and more efficient.Image 3:  LangChain features at a glance. (Image created by Author)Prompts and Prompt Templates: Prompts refer to the inputs or queries we send to LLMs. As we have experienced with ChatGPT, the quality of the response depends heavily on the prompt. LangChain provides several functionalities to simplify the construction and handling of prompts. A prompt template consists of multiple parts, including instructions, content, and queries.Models: While LangChain itself does not provide LLMs, it leverages various Language Models (such as GPT3 and BLOOM, discussed earlier), Chat Models (like get-3.5-turbo), and Text Embedding Models (offered by CohereAI, HuggingFace, OpenAI).Chains: Chains are an end-to-end wrapper around multiple individual components, playing a major role in LangChain. The two most common types of chains are LLM chains and vector index chains.Memory: By default, Chains in LangChain are stateless, treating each incoming query or input independently without retaining context (i.e., lacking memory). To overcome this limitation, LangChain assists in both short-term memory (using previous conversational messages or summarised messages) and long-term memory (managing the retrieval and updating of information between conversations).Indexes: Index modules provide various document loaders to connect with different data resources and utility functions to seamlessly integrate with external vector databases like Pinecone, ChromoDB, and Weaviate, enabling smooth handling of large arrays of vector embeddings. The types of vector indexes include Document Loaders, Text Splitters, Retriever, and Vectorstore.Agents: While the sequence of chains is often deterministic, in certain applications, the sequence of calls may not be deterministic, with the next step depending on the user input and previous responses. Agents utilize LLMs to determine the appropriate actions and their orders. Agents perform these tasks using a suite of tools.Limitations on LangChain usageAbstraction challenge for debugging: The comprehensive abstraction provided by LangChain poses challenges for debugging as it becomes difficult to comprehend the underlying processes.Higher token consumption due to prompt coupling: Coupling a chain of prompts when executing multiple chains for a specific task often leads to higher token consumption, making it less cost-effective. Increased latency and slower performance: The latency period experienced when using LangChain in applications with agents or tools is higher, resulting in slower performance.Overall, LangChain provides a broad spectrum of features and modules that greatly enhance our interaction with LLMs. In the subsequent sections, we will explore the practical usage of LangChain and demonstrate how to build simple demo web applications using its capabilities.Referenceshttps://docs.langchain.com/docs/https://github.com/hwchase17/langchain https://medium.com/databutton/getting-started-with-langchain-a-powerful-tool-for-working-with-large-language-models-286419ba0842https://medium.com/@avra42/how-to-build-a-personalized-pdf-chat-bot-with-conversational-memory-965280c160f8AuthorAvratanu Biswas, Ph.D. Student ( Biophysics ), Educator, and Content Creator, ( Data Science, ML & AI ).Twitter    YouTube    Medium     GitHub
Read more
  • 0
  • 0
  • 19384

article-image-optimized-llm-serving-with-tgi
Louis Owen
25 Sep 2023
7 min read
Save for later

Optimized LLM Serving with TGI

Louis Owen
25 Sep 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionThe release of LLaMA-2 LLM to the public has been a tremendous contribution by Meta to the open-source community. It’s very powerful that many developers are fine-tuning it for so many different use cases. Closed-source LLMs such as GPT3.5, GPT4, or Claude are really convenient to utilize since we just need to send API requests to power up our application. On the other hand, utilizing open-source LLMs such as LLaMA-2 in production is not an easy task. Although there are several algorithms that support deploying an LLM with CPU, most of the time we still need to utilize GPU for better throughput.At a high level, there are two main things that need to be taken care of when deploying your LLM in production: memory and throughput. LLM is very big in size, the smallest version of LLaMA-2 has 7 billion parameters, which requires around 28GB GPU RAM for loading the model. Google Colab offers free NVIDIA T4 GPU to be used with 16GB memory, meaning that we can’t load even the smallest version of LLaMA-2 freely using the GPU provided by Google Colab.There are two popular ways to solve this memory issue: half-precision and 4/8-bit quantization. Half-precision is a very simple optimization method that we can adopt. In PyTorch, we just need to add the following line and we can load the 7B model by using only around 13GB GPU RAM. Basically, half-precision means that we’re representing the weights of the model with 16-bit floating points (fp16) instead of 32-bit floating points (fp32).torch.set_default_tensor_type(torch.cuda.HalfTensor)Another way to solve the memory issue is by performing 4/8-bits quantization. There are two quantization algorithms that are widely adopted by the developers: GPT-Q and BitsandBytes. These two algorithms are also supported within the `transformers` package. Quantization is a way to reduce the model size by representing the model weights in lower precision data types, such as 8-bit integers (int8) or 4-bit integers (int4) instead of 32-bit floating point (fp32) or 16-bit (fp16).As for comparison, after applying the 4-bit quantization with the GPT-Q algorithm, we can load the 7B model by using only around 4GB GPU RAM! This is indeed a huge drop - from 28GB to 4GB!Once we’re able to load the model, the simplest way to serve the model is by using the native HuggingFace workflow along with Flask or FastAPI. However, serving our model with this simple solution is not scalable and reliable enough to be used in production since it can’t handle parallel requests decently. This is related to the throughput problem mentioned earlier. There are many ways to solve this throughput problem, starting from continuous batching, tensor parallelism, prompt caching, paged attention, prefill parallel computing, KV cache, and many more. Discussing each of the algorithms will result in a very long article. However, luckily there’s an inference serving library for LLM that applies all of these techniques to ensure the reliability of serving our LLM in production. This library is called Text Generation Inference (TGI).Throughout this article, we’ll discuss in more detail what TGI is and how to utilize it to serve your LLM. We’ll see what are the top features of TGI and the step-by-step tutorial on how to spin up TGI as the inference server.Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to utilize TGI as an inference server!What’s TGI?Text Generation Inference (TGI) is like the `transformers` package for inference servers. In fact, TGI is used in production at HuggingFace to power many applications. It provides a lot of useful features:Provide a straightforward launcher for the most popular Large Language Models.Implement Tensor Parallelism to enhance inference speed across multiple GPUs.Enable token streaming through Server-Sent Events (SSE).Implement continuous batching of incoming requests to boost overall throughput.Enhance transformer code for inference using flash-attention and Paged Attention on leading architectures.Apply quantization techniques with bitsandbytes and GPT-Q.Ensure fast and safe persistence format of weight loading with Safetensors.Incorporate watermarking using A Watermark for Large Language Models.Integrate a logit warper for various operations like temperature scaling, top-p, top-k, and repetition penalty.Implement stop sequences and log probabilities.Ensure production readiness with distributed tracing through Open Telemetry and Prometheus metrics.Offer Custom Prompt Generation, allowing users to easily generate text by providing custom prompts for guiding the model's output.Provide support for fine-tuning, enabling the use of fine-tuned models for specific tasks to achieve higher accuracy and performance.Implement RoPE scaling to extend the context length during inferenceTGI is surely a one-stop solution for serving LLMs. However, one important note about TGI that you need to know is regarding the project license. Starting from v1.0, HuggingFace has updated the license for TGI using the HFOIL 1.0 license. For more details, please refer to this GitHub issue. If you’re using TGI only for research purposes, then there’s nothing to worry about. If you’re using TGI for commercial purposes, please make sure to read the license carefully since there are cases where it can and can’t be used for commercial purposes.How to Use TGI?Using TGI is relatively easy and straightforward. We can use Docker to spin up the server. Note that you need to install the NVIDIA Container Toolkit to use GPU.docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $modelOnce the server is up, we can easily send API requests to the server using the following command.curl 127.0.0.1:8080/generate \    -X POST \    -d '{"inputs":"What is LLM?","parameters":{"max_new_tokens":120}}' \    -H 'Content-Type: application/json'For more details regarding the API documentation, please refer to this page. Another way to send API requests to the TGI server is by sending it from Python. To do that, we need to install the Python library firstpip install text-generationOnce the library is installed, we can send API requests like the following.from text_generation import Client client = Client("http://127.0.0.1:8080") print(client.generate("What is LLM?", max_new_tokens=120).generated_text) text = "" for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):    if not response.token.special:        text += response.token.text print(text)ConclusionCongratulations on keeping up to this point! Throughout this article, you have learned what are the important things to be considered when you’re trying to serve your own LLM. You have also learned about TGI, starting from what is it, what’s the features, and also step-by-step examples on how to get the TGI server up. Hope the best for your LLM deployment journey and see you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.
Read more
  • 0
  • 0
  • 18816

article-image-building-and-deploying-web-app-using-langchain
Avratanu Biswas
26 Jun 2023
12 min read
Save for later

Building and deploying Web App using LangChain

Avratanu Biswas
26 Jun 2023
12 min read
So far, we've explored the LangChain modules and how to use them (refer to the earlier blog post on LangChain Modules here). In this section, we'll focus on the LangChain Indexes and Agent module and also walk through the process of creating and launching a web application that everyone can access. To make things easier, we'll be using Databutton, an all-in-one online workspace to build and deploy web apps, integrated with Streamlit, a Python web- development framework known for its support in building interactive web applications.What are LangChain Agents?In simpler terms, LangChain Agents are tools that enable Large Language Models (LLMs) to perform various actions, such as accessing Google search, executing Python calculations, or making SQL queries, thereby empowering LLMs to make informed decisions and interact with users by using tools and observing their outputs. The official documentation of LangChain describes Agents as:" …there is an agent which has access to a suite of tools. Depending on the user input, the agent can then decide which, if any, of these tools to call… In building agents, there are several abstractions involved. The Agent abstraction contains the application logic, receiving user input and previous steps to return either an AgentAction (tool and input) or AgentFinish (completion information). Agent covers another aspect, called Tools, which represents the actions agents can take, while Toolkits group tools for specific use cases (e.g., SQL querying). Lastly, the Agent Executor manages the iterative execution of the agent with the available tools. Thus, in this section, we will briefly explore such abstractions while using the Agent functionality to integrate tools and primarily focus on building a real-world easily deployable web application.IndexesThis module provides utility functions for structuring documents using indexes and allowing LLMs to interact with them effectively. We will focus on one of the most commonly used retrieval systems, where indexes are used to find the most relevant documents based on a user's query. Additionally, LangChain supports various index and retrieval types, with a focus on vector databases for unstructured data. We will explore this component in detail as it can be leveraged in a wide number of real-world applications.Image 1 Langchain workflow by AuthorWorkflow of a question & answer generation interface using Retrieval index, where we leverage all types of Indexes which LangChain provides. Indexes are primarily of four types, namely : Document Loaders, Text Splitters, VectorStores, and Retrievers. Briefly, (a) the documents fetched from any datasource is split into chunks using text splitter modules (b) Embeddings are created (c)Stored over a vector store index ( vector databases such as chromadb / pinecone / weaviate, etc ) (d) Queries from the user is retrieved via retrieval QA chain We will use the  WikipediaLoader to load Wikipedia documents related to our query "LangChain" and retrieve the metadata and a portion of the page content of the first document.from langchain.document_loaders import WikipediaLoader docs = WikipediaLoader(query='LangChain', load_max_docs=2).load() docs[0].metadata docs[0].page_content[:400]CharacterTextSplitter is used to split the loaded documents into smaller chunks for further processing.from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0) texts = text_splitter.split_documents(docs)The OpenAIEmbeddings the module is then employed to generate embeddings for the text chunks.from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)We will use Chroma vector store, which is created from the generated text chunks and embeddings, allowing for efficient storage and retrieval of vectorized data.Next, the RetrievalQA module is instantiated with an OpenAI LLM and the created retriever, setting up a question-answering system.from langchain.vectorstores import Chroma db = Chroma.from_documents(texts, embeddings) retriever = db.as_retriever() from langchain.chains import RetrievalQA from langchain.llms import OpenAI Qa  = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=OPENAI_API_KEY), chain_type="stuff", retriever=retriever)At this stage, we can easily seek answers from the stored indexed data. For instance, query = "What is LangChain?" qa.run(query)LangChain is a framework designed to simplify the creation of applications using large language models (LLMs).query = "When was LangChain founded?" qa.run(query)LangChain was founded in October 2022.query = "When was LangChain founded?" qa.run(query)LangChain was founded in October 2022.query = "Who is the founder?" qa.run(query) The founder of LangChain is Harrison Chase.The Q&A functionality implemented using the retrieval chain provides reasonable answers to most of our queries. Different types of indexes provided by LangChain, can be leveraged for various real-world use cases involving data structuring and retrieval. Moving forward, we will delve into the next section, where we will focus on the final component called the "Agent." During this section, we will not only gain a hands-on understanding of its usage but also build and deploy a web app using an online workspace called Databutton.Building Web App using DatabuttonPrerequisitesTo begin using Databutton, all that is required is to sign up through their official website. Once logged in, we can either create a blank template app from scratch or choose from the pre-existing templates provided by Databutton.Image by Author | Screen grasp showing on how to start working with a new blank appOnce the blank app is created, we generate our online workspace consisting of several features for building and deploying the app. We can immediately begin writing our code within the online editor. The only requirement at this stage is to include the necessary packages or dependencies that our app requires.Image by Author | Screen grasp showing the different components available within the Databutton App's online workspace. Databutton's workspace initialization includes some essential packages by default. However, for our specific app, we need to add two additional packages - openai and langchain. This can be easily accomplished within the "configuration" workspace of Databutton.Image by Author | Screen grasp of the configuration options within Databutton's online workspace. Here we can add the additional packages which we need for working with our app. The workspace is generated with few pre-installed dependencies.Writing the codeNow that we have a basic understanding of Agents and their abstraction methods, let's put them to use, alongside incorporating some basic Streamlit syntax for the front end.Importing the required modules: For building the web app, we will require the Streamlit library and several LangChain modules. Additionally, we will utilise a helper function that relies on the sys and io libraries for capturing and displaying function outputs. We will discuss the significance of this helper function towards the end to better understand its purpose.# Modules to Import import streamlit as st import sys import io import re from typing import Callable, Any from langchain.agents.tools import Tool from langchain.agents import initialize_agent from langchain.llms import OpenAI from langchain.chains import LLMChain from langchain import LLMMathChain from langchain import PromptTemplateUsing the LangChain modules and building the main user interface: We set the title of the app using st.title() syntax and also enables the user to enter their OpenAI API key using the st.text_input() widget.# Set the title of the app st.title("LangChain `Agent` Module Usage Demo App") # Get the OpenAI API key from the user OPENAI_API_KEY = st.text_input( "Enter your OpenAI API Key to get started", type="password" )As we discussed in the previous sections, we need to define a template for the prompt that incorporates a placeholder for the user's query.# Define a template for the prompt template = """You are a friendly and polite AI Chat Assistant. You must try to provide accurate but concise answers. If you don't know the answer, just say "I don't know." Question: {query} Answer: """ # Create a prompt template object with the template prompt = PromptTemplate(template=template, input_variables=["query"])Next, we implement a conditional loop. If the user has provided an OpenAI API key, we proceed with the flow of the app. The user is asked to enter their query using the st.text_input() widget.# Check if the user has entered an OpenAI API key if OPENAI_API_KEY: # Get the user's query query = st.text_input("Ask me anything")Once the user has the correct API keys inserted, from this point onward, we will proceed with the implementation of LangChain modules. Some of these modules may be new to us, while others may have already been covered in our previous sections.Next, we create instances of the OpenAI language model, OpenAI, the LLMMathChain for maths-related queries, and the LLMChain for general-purpose queries.# Check if the user has entered a query if query: # Create an instance of the OpenAI language model llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY) # Create an instance of the LLMMathChain for math-related queries llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True) # Create an instance of the LLMChain for general-purpose queries llm_chain = LLMChain(llm=llm, prompt=prompt)Following that, we create a list of tools that the agent will utilize. Each tool comprises a name, a corresponding function to handle the query, and a brief description.# Create a list of tools for the agent tools = [ Tool( name="Search", func=llm_chain.run, description="Useful for when you need to answer general purpose questions", ), Tool( name="Calculator", func=llm_math_chain.run, description="Useful for when you need to answer questions about math", ), ]Further, we need to initialize a zero-shot agent with the tools and other parameters. This agent employs the ReAct framework to determine which tool to utilize based solely on the description associated with each tool. It is essential to provide a description of each tool.# Initialize the zero-shot agent with the tools and parameters zero_shot_agent = initialize_agent( agent="zero-shot-react-description", tools=tools, llm=llm, verbose=True, max_iterations=3, ) And now finally, we can easily call the zero-shot agent with the user's query using the run(query) method.# st.write(zero_shot_agent.run(query))However, this would only yield the final outcome of the result within our Streamlit UI, without providing access to the underlying LangChain thought process (i.e. the verbose ) that we typically observe in a Notebook environment. This information is crucial to understand which tools our agent is opting for based on the user query. To address this, a helper function called capture_and_display_output was created.# Helper function to dump LangChain Verbose / Though Process # Function to capture and display the output of a function def capture_and_display_output(func: Callable[..., Any], args, **kwargs) -> Any: # Redirect stdout to a string buffer original_stdout = sys.stdout sys.stdout = output_catcher = io.StringIO() # Call the function and capture the response response = func(args, *kwargs) # Restore the original stdout and get the captured output sys.stdout = original_stdout output_text = output_catcher.getvalue() # Clean the output text by removing escape sequences cleaned_text = re.sub(r"\x1b\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]", "", output_text) # Split the cleaned text into lines and concatenate them with line breaks lines = cleaned_text.split("\n") concatenated_string = "\n".join([s if s else "\n" for s in lines]) # Display the captured output in an expander with st.expander("Thoughts", expanded=True): st.write(concatenated_string)This function allows users to monitor the actions undertaken by the agent. Consequently, the response from the agent is displayed within the UI.# Call the zero-shot agent with the user's query and capture the output response = capture_and_display_output(zero_shot_agent.run, query) Image by Author | Screen grasp of the app in local deployment displays the entire verbose or rather the thought process  Deploy and Testing of the AppThe app can now be easily deployed by clicking the "Deploy" button on the top left-hand side. The deployed app will provide us with a unique URL that can be shared with everyone!Image by Author | Screen grasp of the Databutton online workspace showing the Deploy options. Yay! We have successfully built and deployed a LangChain-based web app from scratch. Here's the link to the app ! The app also consists of a view code page , which can be accessed via this link.To test the web app, we will employ two different types of prompts. One will be a general question that can be answered by any LLMs, while the other will be a maths-related question. Our hypothesis is that the LangChain agents will intelligently determine which agents to execute and provide the most appropriate response. Let's proceed with the testing to validate our assumption.Image by Author | Screen grasped from the deployed web app.Two different prompts were used to validate our assumptions. Based on the thought process ( displayed in the UI under the thoughts expander ), we can easily interpret which Tool has been chosen by the Agent. (Left) Usage of LLMMath chain incorporating Tool (Right) Usage of a simple LLM Chain incorporating Tool.ConclusionTo summarise, we have not only explored various aspects of working with LangChain and LLMs but have also successfully built and deployed a web app powered by LangChain. This demonstrates the versatility and capabilities of LangChain in enabling the development of powerful applications.ReferencesLangChain Agents official documentation : https://python.langchain.com/en/latest/modules/agents.htmlDatabutton : https://www.databutton.io/Streamlit :  https://streamlit.io/ Build a Personal Search Engine Web App using Open AI Text Embeddings : https://medium.com/@avra42/build-a-personal-search-engine-web-app-using-open-ai-text-embeddings-d6541f32892dPart 1: Using LangChain for Large Language Model — powered Applications: https://www.packtpub.com/article-hub/using-langchain-for-large-language-model-powered-applicationsDeployed Web app - https://databutton.com/v/23ks6sem Source code for the app - https://databutton.com/v/23ks6sem/View_CodeAuthor BioAvratanu Biswas, Ph.D. Student ( Biophysics ), Educator, and Content Creator, (Data Science, ML & AI ).Twitter    YouTube    Medium     GitHub
Read more
  • 0
  • 0
  • 18743

article-image-demystifying-azure-openai-service
Olivier Mertens, Breght Van Baelen
15 Sep 2023
16 min read
Save for later

Demystifying Azure OpenAI Service

Olivier Mertens, Breght Van Baelen
15 Sep 2023
16 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Azure Data and AI Architect Handbook, by Olivier Mertens and Breght Van Baelen. Master core data architecture design concepts and Azure Data & AI services to gain a cloud data and AI architect’s perspective to developing end-to-end solutionsIntroductionOpenAI has risen immensely in popularity with the arrival of ChatGPT. The company, which started as an on-profit organization, has been the driving force behind the GPT and DALL-E model families, with intense research at a massive scale. The speed at which new models get released and become available on Azure has become impressive lately.Microsoft has a close partnership with OpenAI, after heavy investments in the company from Microsoft . The models created by OpenAI use Azure infrastructure for development and deployment. Within this partnership, OpenAI carries the responsibility of research and innovation, coming up with new models and new versions of their existing models. Microsoft manages the enterprise-scale go-to-market. It provides infrastructure and technical guidance, along with reliable SLAs, to get large organizations started with the integrations of these models, fine-tuning them on their own data, and hosting a private deployment of the models.Like the face recognition model in Azure Cognitive Services, powerful LLMs such as the ones in Azure OpenAI Service could be used to cause harm at scale. Therefore, this service is also gated according to Microsoft ’s guidelines on responsible AI.At the time of writing, Azure OpenAI Service offers access to the following models: GPT model family * GPT-3.5   * GPT-3.5-Turbo (the model behind ChatGPT  * GPT-4CodexDALL-E 2Let’s dive deeper into these models.The GPT model familyGPT models, which stands for generative pre-trained transformer models, made their first appearance in 2018, with GPT-1, trained on a dataset of roughly 7,000 books. This made good advancements in performance at the time, but the model was already vastly outdated a couple of years later. GPT-2 followed in 2019, trained on the WebText dataset (a collection of 8 million web pages). In 2020, GPT-3 was released, trained on the WebText dataset, two book corpora, and  English Wikipedia.In these years, there were no major breakthroughs in terms of efficient algorithms , but rather, in the scale of the architecture and datasets. This becomes easily visible when we look at the growing number of parameters used for every new generation of the model, as shown in the following figure.Figure 9.3 – A visual comparison between the sizes of the different generations of GPT models, based on their trainable parametersThe question is often raised of how to interpret this concept of parameters. An easy analogy is the number of neurons in a brain. Although parameters in a neural network are not equivalent to its artificial neurons, the number of parameters and neurons are heavily correlated – more parameters = more neurons. The more neurons there are in the brain, the more knowledge it can grasp.Since the arrival of GPT-3, we have seen two major adaptations of the third-generation model being made. The first one is GPT-3.5. This model has a similar architecture as the GPT-3 model but is trained on text and code, whereas the original GPT-3 only saw text data during training. Therefore, GPT-3.5 is capable of generating and understanding code. GPT-3.5, in turn, became the basis for the next adaptation, the vastly popular ChatGPT model. This model has been fine-tuned for conversational usage while using additional reinforcement learning to get a sense of ethical behavior.GPT model sizesThe OpenAI models are available in different sizes, which are all named after remarkable scientists. The GPT-3.5 model specifically, is  available in four versions:AdaBabbage CurieDavinciThe Ada model is the smallest, most lightweight model, while Davinci is the most complex and most performant model. The larger the model, the more expensive it is to use, host, and fine-tune, as shown in Figure 9.4. As a side note, when you hear about the absurd number of parameters of new GPT models, this usually refers to the Davinci model.Figure 9.4 – A trade-off exists between lightweight, cheap models and highly performant, complex modelsWith a trade-off between costs and performance available, an architect can start thinking about which model size may best fit a solution. In reality, this often comes down to empirical testing. If the cheaper model can perform the job at an acceptable performance, then this is the more cost-effective solution. Note that when talking about performance in this scenario, we mean predictive power, not the speed at which the model makes predictions. The larger models will be slower to output a prediction than the lightweight models.Understanding the difference between GPT-3.5 and GPT-3.5-Turbo (ChatGPT)GPT-3.5 and GPT-3.5-Turbo are both models used to generate natural language text, but they are used in different ways. GPT-3.5 is classified as a text completion model, whereas GPT-3.5-Turbo is referred to as conversational AI.To better understand the contrast between the two models, we first need to introduce the concept of contextual learning. These models are trained to understand the structure of the input prompt to provide a meaningful answer. Contextual learning is often split up into few-shot learning, one-shot learning, and zero-shot learning. Shot, in this context, refers to an example given in the input prompt. With few-shot learning, we provide multiple examples in the input prompt, one-shot learning provides a single example, and zero-shot indicates that no examples are given. In the case of the latter, the model will have to figure out a different way to understand what is being asked of it (such as interpreting the goal of a question).Consider the following example:Figure 9.5 – Few-shot learning takes up the most amount of tokens and requires more effort but often results in model outputs of higher qualityWhile it takes more prompt engineering effort to apply few-shot learning, it will usually yield better results. A text completion model, such as GPT-3.5, will perform vastly better on few-shot learning than one-shot or zero-shot. As the name suggests, the model figures out the structure of the input prompt (i.e., the examples) and completes the text accordingly.Conversational AI, such as ChatGPT, is more performant in zero-shot learning. In the case of the preceding example, both models are able to output the correct answer, but as questions become more and more complex, there will be a noticeable difference in predictive performance. Additionally, GPT-3.5-Turbo will remember information from previous input prompts, whereas GPT-3.5 prompts are handled independently.Innovating with GPT-4With the arrival of GPT-4, the focus has shifted toward multimodality. Multimodality in AI refers to the ability of an AI system to process and interpret information from multiple modalities, such as text, speech, images, and videos. Essentially, it is the capability of AI models to understand and combine data from different sources and formats.GPT-4 is capable of additionally taking images as input and interpreting them. It has stronger reasoning and overall performance than its predecessors. There was a famous example where GPT-4 was able to deduce that balloons would fly upward when asked what would happen if someone cut the balloons' strings, as shown in the following photo.Figure 9.6 – The image in question that was used in the experiment. When asked what would happen if the strings were cut, GPT-4 replied that the balloons would start flying awaySome adaptations of GPT-4, such as the one used in Bing Chat, have the extra feature of citing sources in generated answers. This is a welcome addition, as hallucination was a significant flaw in earlier GPT models.HallucinationHallucination in the context of AI refers to generating wrong predictions with high confidence. It is obvious that this can cause a lot more harm than the model indicating it is not sure how to respond or knowing the answer.Next, we will look at the Codex model.CodexCodex is a model that is architecturally similar to GPT-3, but it fully focuses on code generation and understanding. Furthermore, an adaptation of Codex forms the underlying model for GitHub Copilot, a tool that provides suggestions and auto-completion for code based on context and natural language inputs, available for various integrated development environments (IDEs) such as Visual Studio Code. Instead of a ready-to-use solution, Codex is (like the other models in Azure OpenAI) available as a model endpoint and should be used for integration in custom apps.The Codex model is initially trained on a collection of 54 million code repositories, resulting in billions of lines of code, with the majority of training data written in Python.Codex can generate code in different programming languages based on an input prompt in natural language (text-to-code), explain the function of blocks of code (code-to-text), add comments to code, and debug existing code.Codex is available as a C (cushman) and D (Davinci) model. Lightweight Codex models (A series or B series) currently do not exist.Models such as Codex or GitHub Copilot are a great way to boost the productivity of software engineers, data analysts, data engineers, and data scientists.  They do not replace these roles, as their accuracy is not perfect; rather, they give engineers the opportunity to start editing from a fairly well-written block of code instead of coding from scratch.DALL-E 2The DALL-E model family is used to generate visuals. By providing a  description in natural language in the input prompt, it generates a series of matching images. While other models are often used at scale in large enterprises, DALL-E 2 tends to be more popular in smaller businesses. Organizations that lack an in-house graphic designer can make great use of DALL-E to generate visuals for banners, brochures, emails, web pages, and so on.DALL-E 2 only has a  single model size to choose from, although open-source alternatives exist if a lightweight version is preferred. Fine-tuning and private deploymentsAs a data architect, it is important to understand the cost structure of these models. The first option is to use the base model in a serverless manner. Similar to how we work with Azure Cognitive Services, users will get a key for the model’s endpoint and simply pay per prediction. For DALL-E 2, costs are incurred per 100 images, while the GPT and Codex models are priced per 1,000 tokens. For every request made to a GPT or Codex model, all tokens of the input prompt and the output are added up to determine the cost of the prediction.TokensIn natural language processing, a token refers to a sequence of characters that represents a distinct unit of meaning in a text. These units do not necessarily correspond to words, although for short words, this is mostly the case. Tokens are used as the basic building blocks to process and analyze text data. A good rule of thumb for the English language is that one token is, on average, four characters. Dividing your total character count by four will make a good estimate of the number of tokens.Azure OpenAI Service also grants extensive fine-tuning functionalities. Up to 1 GB of data can be uploaded per Azure OpenAI instance for fine-tuning. This may not sound like a lot, but note that we are not training a new model from scratch. The goal of fine-tuning is to retrain the last few layers of the model to increase performance on specific tasks or company-specific knowledge. For this process, 1 GB of data is more than sufficient.When adding a fine-tuned model to a solution, two additional costs will be incurred. On top of the token-based inference cost, we need to take into account the training and hosting costs. The hourly training cost can be quite high due to the amount of hardware needed, but compared to the inference and hosting costs during a model’s life cycle, it remains a small percentage. Next, since we are not using the base model anymore and, instead, our own “version” of the model, we will need to host the model ourselves, resulting in an hourly hosting cost.Now that we have covered both pre-trained model collections, Azure Cognitive Services, and Azure OpenAI Service, let’s move on to custom development using Azure Machine Learning.Grounding LLMsOne of the most popular use cases for LLMs involves providing our own data as context to the model (often referred to as grounding). The reason for its popularity is partly due to the fact that many business cases can be solved using a consistent technological architecture. We can reuse the same solution, but by providing different knowledge bases, we can serve different end users.For example, by placing an LLM on top of public data such as product manuals or product specifics, it is easy to develop a customer support chatbot. If we swap out this knowledge base of product information with something such as HR documents, we can reuse the same tech stack to create an internal HR virtual assistant.A common misconception regarding grounding is that a model needs to be trained on our own data. This is not the case. Instead, after a user asks a question, the relevant document (or paragraphs) is injected into the prompt behind the scenes and lives in the memory of the model for the duration of the chat session (when working with conversational AI) or for a single prompt. The context, as we call it, is then wiped clean and all information is forgotten. If we wanted to cache this info, it is possible to make use of a framework such as LangChain or Semantic Kernel, but that is out of the scope of this book.The fact that a model does not get retrained on our own data plays a crucial role in terms of data privacy and cost optimization. As shown before in the section on fine-tuning, as soon as a base model is altered, an hourly operating cost is added to run a private deployment of the model. Also, information from the documents cannot be leaked to other users working with the same model.Figure 9.7 visualizes the architectural concepts to ground an LLM.Figure 9.7 – Architecture to ground an LLMThe first thing to do is turn the documents that should be accessible to the model into embeddings. Simply put, embeddings are mathematical representations of natural language text. By turning text into embeddings, it is possible to accurately calculate the similarity (from a semantics perspective) between two pieces of text.To do this, we can leverage Azure Functions, a service that allows pieces of code to run in a serverless function. It often forms the glue between different components by handling interactions. In this case, an Azure function (on the bottom left of Figure 9.7) will grab the relevant documents from the knowledge base, break them up into chunks (to accommodate for the maximum token limits of the model), and generate an embedding for each one. This embedding is then stored, alongside the natural language text, in a vector database. This function should be run for all historic data that will be accessible to the model, as well as triggered for every new, relevant document that is added to the knowledge base.Once the vector database is in place, users can start asking questions. However, the user questions are not directly sent to the model endpoint. Instead, another Azure function (shown at the top of Figure 9.7) will turn the user question into an embedding and check its similarity of it with the embeddings of the documents or paragraphs in the vector database. Then, the top X most relevant text chunks are injected into the prompt as context, and the prompt is sent over to the LLM. Finally, the response is returned to the user.ConclusionAzure OpenAI Service, a collaboration between OpenAI and Microsoft, delivers potent AI models. The GPT model family, from GPT-1 to GPT-4, has evolved impressively, with GPT-3.5-Turbo (ChatGPT) excelling in conversational AI. GPT-4 introduces multimodal capabilities, comprehending text, speech, images, and videos.Codex specializes in code generation, while DALL-E 2 creates visuals from text descriptions. These models empower developers and designers. Customization via fine-tuning offers cost-effective solutions for specific tasks. Leveraging Azure OpenAI Service for your projects enhances productivity.Grounding language models with user data ensures data privacy and cost efficiency. This collaboration holds promise for innovative AI applications across various domains.Author BioOlivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium.Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master’s degree in information management, a postgraduate degree as an AI business architect, and a bachelor’s degree in business management.Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium.Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master’s degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor’s degree in computer science from the University of Hasselt.
Read more
  • 0
  • 0
  • 17954
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-deploying-llms-with-amazon-sagemaker-part-2
Joshua Arvin Lat
30 Nov 2023
19 min read
Save for later

Deploying LLMs with Amazon SageMaker - Part 2

Joshua Arvin Lat
30 Nov 2023
19 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn the first part of this post, we showed how easy it is to deploy large language models (LLMs) in the cloud using a managed machine learning service called Amazon SageMaker. In just a few steps, we were able to deploy a MistralLite model in a SageMaker Inference Endpoint. If you’ve worked on real ML-powered projects in the past, you probably know that deploying a model is just the first step! There are definitely a few more steps before we can consider that our application is ready for use.If you’re looking for the link to the first part, here it is: Deploying LLMs with Amazon SageMaker - Part 1In this post, we’ll build on top of what we already have in Part 1 and prepare a demo user interface for our chatbot application. That said, we will tackle the following sections in this post:● Section I: Preparing the SageMaker Notebook Instance (discussed in Part 1)● Section II: Deploying an LLM using the SageMaker Python SDK to a SageMaker Inference Endpoint (discussed in Part 1)● Section III: Enabling Data Capture with SageMaker Model Monitor●  Section IV: Invoking the SageMaker inference endpoint using the boto3 client●  Section V: Preparing a Demo UI for our chatbot application●  Section VI: Cleaning UpWithout further ado, let’s begin!Section III: Enabling Data Capture with SageMaker Model MonitorIn order to analyze our deployed LLM, it’s essential that we’re able to collect the requests and responses to a central storage location. Instead of building our own solution that collects the information we need, we can just utilize the built-in Model Monitor capability of SageMaker. Here, all we need to do is prepare the configuration details and run the update_data_capture_config() method of the inference endpoint object and we’ll have the data capture setup enabled right away! That being said, let’s proceed with the steps required to enable and test data capture for our SageMaker Inference endpoint:STEP # 01: Continuing where we left off in Part 1 of this post, let’s get the bucket name of the default bucket used by our session:s3_bucket_name = sagemaker_session.default_bucket() s3_bucket_nameSTEP # 02: In addition to this, let’s prepare and define a few prerequisites as well:prefix = "llm-deployment" base = f"s3://{s3_bucket_name}/{prefix}" s3_capture_upload_path = f"{base}/model-monitor"STEP # 03: Next, let’s define the data capture config:from sagemaker.model_monitor import DataCaptureConfig data_capture_config = DataCaptureConfig(    enable_capture = True,    sampling_percentage=100,    destination_s3_uri=s3_capture_upload_path,    kms_key_id=None,    capture_options=["REQUEST", "RESPONSE"],    csv_content_types=["text/csv"],    json_content_types=["application/json"] )Here, we specify that we’ll be collecting 100% of the requests and responses that pass through the deployed model.STEP # 04: Let’s enable data capture so that we’re able to save in Amazon S3 the request and response data:predictor.update_data_capture_config(    data_capture_config=data_capture_config )Note that this step may take about 8-10 minutes to complete. Feel free to grab a cup of coffee or tea while waiting!STEP # 05: Let’s check if we are able to capture the input request and output response by performing another sample request:result = predictor.predict(input_data)[0]["generated_text"] print(result)This should yield the following output:"The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries. There is no single answer that can be definitively proven, as the meaning of life is subjective and can vary greatly from person to person.\n\nSome people believe that the meaning of life is to find happiness and fulfillment through personal growth, relationships, and experiences. Others believe that the meaning of life is to serve a greater purpose, such as through a religious or spiritual calling, or by making a positive impact on the world through their work or actions.\n\nUltimately, the meaning of life is a personal journey that each individual must discover for themselves. It may involve exploring different beliefs and perspectives, seeking out new experiences, and reflecting on what brings joy and purpose to one's life."Note that it may take a minute or two before the .jsonl file(s) containing the request and response data appear in our S3 bucket.STEP # 06: Let’s prepare a few more examples:prompt_examples = [    "What is the meaning of life?",    "What is the color of love?",    "How to deploy LLMs using SageMaker",    "When do we use Bedrock and when do we use SageMaker?" ] STEP # 07: Let’s also define the perform_request() function which wraps the relevant lines of code for performing a request to our deployed LLM model:def perform_request(prompt, predictor):    input_data = {        "inputs": f"<|prompter|>{prompt}</s><|assistant|>",        "parameters": {            "do_sample": False,            "max_new_tokens": 2000,            "return_full_text": False,        }    }      response = predictor.predict(input_data)    return response[0]["generated_text"] STEP # 08: Let’s quickly test the perform_request() function:perform_request(prompt_examples[0], predictor=predictor)STEP # 09: With everything ready, let’s use the perform_request() function to perform requests using the examples we’ve prepared in an earlier step:from time import sleep for example in prompt_examples:    print("Input:", example)      generated = perform_request(        prompt=example,        predictor=predictor    )    print("Output:", generated)    print("-"*20)    sleep(1)This should return the following:Input: What is the meaning of life? ... -------------------- Input: What is the color of love? Output: The color of love is often associated with red, which is a vibrant and passionate color that is often used to represent love and romance. Red is a warm and intense color that can evoke strong emotions, making it a popular choice for representing love. However, the color of love is not limited to red. Other colors that are often associated with love include pink, which is a softer and more feminine shade of red, and white, which is often used to represent purity and innocence. Ultimately, the color of love is subjective and can vary depending on personal preferences and cultural associations. Some people may associate love with other colors, such as green, which is often used to represent growth and renewal, or blue, which is often used to represent trust and loyalty. ...Note that this is just a portion of the overall output and you should get a relatively long response for each input prompt.Section IV: Invoking the SageMaker inference endpoint using the boto3 clientWhile it’s convenient to use the SageMaker Python SDK to invoke our inference endpoint, it’s best that we also know how to use boto3 as well to invoke our deployed model. This will allow us to invoke the inference endpoint from an AWS Lambda function using boto3.Image 10 — Utilizing API Gateway and AWS Lambda to invoke the deployed LLMThis Lambda function would then be triggered by an event from an API Gateway resource similar to what we have in Image 10. Note that we’re not planning to complete the entire setup in this post but having a working example of how to use boto3 to invoke the SageMaker inference endpoint should easily allow you to build an entire working serverless application utilizing API Gateway and AWS Lambda.STEP # 01: Let’s quickly check the endpoint name of the SageMaker inference endpoint:predictor.endpoint_nameThis should return the endpoint name with a format similar to what we have below:'MistralLite-HKGKFRXURT'STEP # 02: Let’s prepare our boto3 client using the following lines of code:import boto3 import json boto3_client = boto3.client('runtime.sagemaker')STEP # 03: Now, let’s invoke the endpointbody = json.dumps(input_data).encode() response = boto3_client.invoke_endpoint(    EndpointName=predictor.endpoint_name,    ContentType='application/json',    Body=body )   result = json.loads(response['Body'].read().decode())STEP # 04: Let’s quickly inspect the result:resultThis should give us the following:[{'generated_text': "The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries. There is no single answer that can be definitively proven, as the meaning of life is subjective and can vary greatly from person to person..."}] STEP # 05: Let’s try that again and print the output text:result[0]['generated_text']This should yield the following output:"The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries..."STEP # 06: Now, let’s define perform_request_2 which uses the boto3 client to invoke our deployed LLM:def perform_request_2(prompt, boto3_client, predictor):    input_data = {        "inputs": f"<|prompter|>{prompt}</s><|assistant|>",        "parameters": {            "do_sample": False,            "max_new_tokens": 2000,            "return_full_text": False,        }    }      body = json.dumps(input_data).encode()    response = boto3_client.invoke_endpoint(        EndpointName=predictor.endpoint_name,        ContentType='application/json',        Body=body    )      result = json.loads(response['Body'].read().decode())    return result[0]["generated_text"]STEP # 07: Next, let’s run the following block of code to have our deployed LLM answer the same set of questions using the perform_request_2() function:for example in prompt_examples:    print("Input:", example)      generated = perform_request_2(        prompt=example,        boto3_client=boto3_client,        predictor=predictor    )    print("Output:", generated)    print("-"*20)    sleep(1)This will give us the following output:Input: What is the meaning of life? ... -------------------- Input: What is the color of love? Output: The color of love is often associated with red, which is a vibrant and passionate color that is often used to represent love and romance. Red is a warm and intense color that can evoke strong emotions, making it a popular choice for representing love. However, the color of love is not limited to red. Other colors that are often associated with love include pink, which is a softer and more feminine shade of red, and white, which is often used to represent purity and innocence. Ultimately, the color of love is subjective and can vary depending on personal preferences and cultural associations. Some people may associate love with other colors, such as green, which is often used to represent growth and renewal, or blue, which is often used to represent trust and loyalty. ... Given that it may take a few minutes before the .jsonl files appear in our S3 bucket, let’s wait for about 3-5 minutes before proceeding to the next section. Feel free to grab a cup of coffee or tea while waiting!STEP # 08: Let’s run the following block of code to list the captured data files stored in our S3 bucket:results = !aws s3 ls {s3_capture_upload_path} --recursive resultsSTEP # 09: In addition to this, let’s store the list inside the processed variable:processed = [] for result in results:    partial = result.split()[-1]    path = f"s3://{s3_bucket_name}/{partial}"    processed.append(path)   processedSTEP # 10: Let’s create a new directory named captured_data using the mkdir command:!mkdir -p captured_dataSTEP # 11: Now, let’s download the .jsonl files from the S3 bucket to the captured_data directory in our SageMaker Notebook Instance:for index, path in enumerate(processed):    print(index, path)    !aws s3 cp {path} captured_data/{index}.jsonlSTEP # 12: Let’s define the load_json_file() function which will help us load files with JSON content:import json def load_json_file(path):    output = []      with open(path) as f:        output = [json.loads(line) for line in f]          return outputSTEP # 13: Using the load_json_file() function we defined in an earlier step, let’s load the .jsonl files and store them inside the all variable for easier viewing:all = [] for i, _ in enumerate(processed):    print(f">: {i}")    new_records = load_json_file(f"captured_data/{i}.jsonl")    all = all + new_records     allRunning this will yield the following response:Image 11 — All captured data points inside the all variableFeel free to analyze the nested structure stored in all variables. In case you’re interested in how this captured data can be analyzed and processed further, you may check Chapter 8, Model Monitoring and Management Solutions of my 2nd book “Machine Learning Engineering on AWS”.Section V: Preparing a Demo UI for our chatbot applicationYears ago, we had to spend a few hours to a few days before we were able to prepare a user interface for a working demo. If you have not used Gradio before, you would be surprised that it only takes a few lines of code to set everything up. In the next set of steps, we’ll do just that and utilize the model we’ve deployed in the previous parts of our demo application:STEP # 01: Continuing where we left off in the previous part, let’s install a specific version of gradio using the following command:!pip install gradio==3.49.0STEP # 02: We’ll also be using a specific version of fastapi as well:!pip uninstall -y fastapi !pip install fastapi==0.103.1STEP # 03: Let’s prepare a few examples and store them in a list:prompt_examples = [    "What is the meaning of life?",    "What is the color of love?",    "How to deploy LLMs using SageMaker",    "When do we use Bedrock and when do we use SageMaker?",    "Try again",    "Provide 10 alternatives",    "Summarize the previous answer into at most 2 sentences" ]STEP # 04: In addition to this, let’s define the parameters using the following block of code:parameters = {    "do_sample": False,    "max_new_tokens": 2000, }STEP # 05: Next, define the process_and_response() function which we’ll use to invoke the inference endpoint:def process_and_respond(message, chat_history):    processed_chat_history = ""    if len(chat_history) > 0:        for chat in chat_history:            processed_chat_history += f"<|prompter|>{chat[0]}</s><|assistant|>{chat[1]}</s>"              prompt = f"{processed_chat_history}<|prompter|>{message}</s><|assistant|>"    response = predictor.predict({"inputs": prompt, "parameters": parameters})    parsed_response = response[0]["generated_text"][len(prompt):]    chat_history.append((message, parsed_response))    return "", chat_historySTEP # 06: Now, let’s set up and prepare the user interface we’ll use to interact with our chatbot:import gradio as gr with gr.Blocks(theme=gr.themes.Monochrome(spacing_size="sm")) as demo:    with gr.Row():        with gr.Column():                      message = gr.Textbox(label="Chat Message Box",                                 placeholder="Input message here",                                 show_label=True,                                 lines=12)            submit = gr.Button("Submit")                      examples = gr.Examples(examples=prompt_examples,                                   inputs=message)        with gr.Column():            chatbot = gr.Chatbot(height=900)      submit.click(process_and_respond,                 [message, chatbot],                 [message, chatbot],                 queue=False)Here, we can see the power of Gradio as we only needed a few lines of code to prepare a demo app.STEP # 07: Now, let’s launch our demo application using the launch() method:demo.launch(share=True, auth=("admin", "replacethis1234!"))This will yield the following logs:Running on local URL:  http://127.0.0.1:7860 Running on public URL: https://123456789012345.gradio.live STEP # 08: Open the public URL in a new browser tab. This will load a login page which will require us to input the username and password before we are able to access the chatbot.Image 12 — Login pageSpecify admin and replacethis1234! in the login form to proceed.STEP # 09: After signing in using the credentials, we’ll be able to access a chat interface similar to what we have in Image 13. Here, we can try out various types of prompts.Image 13 — The chatbot interfaceHere, we have a Chat Message Box where we can input and run our different prompts on the left side of the screen. We would then see the current conversation on the right side.STEP # 10: Click the first example “What is the meaning of life?”. This will auto-populate the text area similar to what we have in Image 14:Image 14 — Using one of the examples to populate the Chat Message BoxSTEP # 11:Click the Submit button afterwards. After a few seconds, we should get the following response in the chat box:Image 15 — Response of the deployed modelAmazing, right? Here, we just asked the AI what the meaning of life is.STEP # 12: Click the last example “Summarize the previous answer into at most 2 sentences”. This will auto-populate the text area with the said example. Click the Submit button afterward.Image 16 — Summarizing the previous answer into at most 2 sentencesFeel free to try other prompts. Note that we are not limited to the prompts available in the list of examples in the interface.Important Note: Like other similar AI/ML solutions, there's the risk of hallucinations or the generation of misleading information. That said, it's critical that we exercise caution and validate the outputs produced by any Generative AI-powered system to ensure the accuracy of the results.Section VI: Cleaning UpWe’re not done yet! Cleaning up the resources we’ve created and launched is a very important step as this will help us ensure that we don’t pay for the resources we’re not planning to use.STEP # 01: Once you’re done trying out various types of prompts, feel free to turn off and clean up the resources launched and created using the following lines of code:demo.close() predictor.delete_endpoint()STEP # 02: Make sure to turn off (or delete) the SageMaker Notebook instance as well. I’ll leave this to you as an exercise!Wasn’t that easy?! As you can see, deploying LLMs with Amazon SageMaker is straightforward and easy. Given that Amazon SageMaker handles most of the heavy lifting to manage the infrastructure, we’re able to focus more on the deployment of our machine learning model. We are just scratching the surface as there is a long list of capabilities and features available in SageMaker. If you want to take things to the next level, feel free to read 2 of my books focusing heavily on SageMaker: “Machine Learning with Amazon SageMaker Cookbook” and “Machine Learning Engineering on AWS”.Author BioJoshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO of 3 Australian-owned companies and also served as the Director for Software Development and Engineering for multiple e-commerce startups in the past. Years ago, he and his team won 1st place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and he has been sharing his knowledge in several international conferences to discuss practical strategies on machine learning, engineering, security, and management. He is also the author of the books "Machine Learning with Amazon SageMaker Cookbook", "Machine Learning Engineering on AWS", and "Building and Automating Penetration Testing Labs in the Cloud". Due to his proven track record in leading digital transformation within organizations, he has been recognized as one of the prestigious Orange Boomerang: Digital Leader of the Year 2023 award winners.
Read more
  • 0
  • 0
  • 17834

article-image-fine-tuning-large-language-models-llms
Amita Kapoor
11 Sep 2023
12 min read
Save for later

Fine-Tuning Large Language Models (LLMs)

Amita Kapoor
11 Sep 2023
12 min read
IntroductionIn the bustling metropolis of machine learning and natural language processing, Large Language Models (LLMs) such as GPT-4 are the skyscrapers that touch the clouds. From chatty chatbots to prolific prose generators, they stand tall, powering a myriad of applications. Yet, like any grand structure, they're not one-size-fits-all. Sometimes, they need a little nipping and tucking to shine their brightest. Dive in as we unravel the art and craft of fine-tuning these linguistic behemoths, sprinkled with code confetti for the hands-on aficionados out there.What's In a Fine-Tune?In a world where a top chef can make spaghetti or sushi but needs finesse for regional dishes like 'Masala Dosa' or 'Tarte Tatin', LLMs are similar: versatile but requiring specialization for specific tasks. A general LLM might misinterpret rare medical terms or downplay symptoms, but with medical text fine-tuning, it can distinguish nuanced health issues. In law, a misread word can change legal interpretations; by refining the LLM with legal documents, we achieve accurate clause interpretation. In finance, where terms like "bearish" and "bullish" are pivotal, specialized training ensures the model's accuracy in financial analysis and predictions.Whipping Up the Perfect AI RecipeJust as a master chef carefully chooses specific ingredients and techniques to curate a gourmet dish, in the vast culinary world of Large Language Models, we have a delectable array of fine-tuning techniques to concoct the ideal AI delicacy. Before we dive into the details, feast your eyes on the visual smorgasbord below, which provides an at-a-glance overview of these methods. With this flavour-rich foundation, we're all set to embark on our fine-tuning journey, focusing on the PEFT method and the Flan-T5 model on the Hugging Face platform. Aprons on, and let's get cooking!Fine Tuning Flan-T5Google AI's Flan-T5, an advanced version of the T5 model, excels in LLMs with its capability to handle text and code. It specialises in Text generation, Translation, Summarization, Question Answering, and Code Generation. Unlike GPT-3 and LLAMA, Flan-T5 is open-source, benefiting researchers worldwide. With configurations ranging from 60M to 11B parameters, it balances versatility and power, though larger models demand more computational resources.For this article, we will leverage the DialogSum dataset, a robust resource boasting 13,460 dialogues, supplemented with manually labelled summaries and topics (and an additional 100 holdout data entries for topic generation). This dataset will serve as the foundation for fine-tuning our open-source giant, Flan-T5, to specialise it for dialogue summarization tasks. Setting the Stage: Preparing the Tool ChestTo fine-tune effectively, ensure your digital setup is optimized. Here's a quick checklist:Hardware: Use platforms like Google Colab.RAM: Memory depends on model parameters. For example:   Memory (MTotal) = 4 x (Number of Parameters x 4 bytes)For a 247,577,856 parameter model (flan-t5-base), around 3.7GB is needed for parameters, gradients, and optimizer states. Ideally, have at least 8GB RAMGPU: A high-end GPU, such as NVIDIA Tesla P100 or T4, speeds up training and inference. Aim for 12GB or more GPU memory, accounting for overheads.Libraries: Like chefs need the right tools, AI fine-tuning demands specific libraries for algorithms, models, and evaluation tools.Remember, your setup is as crucial as the process itself. Let's conjure up the essential libraries, by running the following command:!pip install \    transformers \    datasets \    evaluate \    rouge_score \    loralib \    peftWith these tools in hand, we're now primed to move deeper into the world of fine-tuning. Let's dive right in! Next, it's essential to set up our environment with the necessary tools:from datasets import load_dataset from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer import torch import time import evaluate import pandas as pd import numpy as np To put our fine-tuning steps into motion, we first need a dataset to work with. Enter the DialogSum dataset, an extensive collection tailored for dialogue summarization:dataset_name = "knkarthick/dialogsum" dataset = load_dataset(dataset_name)Executing this code, we've swiftly loaded the DialogSum dataset. With our data playground ready, we can take a closer look at its structure and content to better understand the challenges and potentials of our fine-tuning process. DialogSum dataset is neatly structured into three segments:Train: With a generous 12,460 dialogues, this segment is the backbone of our model's learning process.Validation: A set of 500 dialogues, this slice of the dataset aids in fine-tuning the model, ensuring it doesn't merely memorise but truly understandsTest: This final 1,500 dialogue portion stands as the litmus test, determining how well our model has grasped the art of dialogue summarization.Each dialogue entry is accompanied by a unique 'id', a 'summary' of the conversation, and a 'topic' to give context.Before fine-tuning, let's gear up with our main tool: the Flan-T5 model, specifically it's base' variant from Google, which balances performance and efficiency. Using AutoModelForSeq2SeqLM, we effortlessly load the pre-trained Flan-T5, set to use torch.bfloat16 for optimal memory and precision. Alongside, we have the tokenizer, essential for translating text into a model-friendly format. Both are sourced from google/flan-t5-base, ensuring seamless compatibility. Now, let's get this code rolling:model_name='google/flan-t5-base' original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name)Understanding Flan-T5 requires a look at its structure, particularly its parameters. Knowing the number of trainable parameters shows the model's adaptability. The total parameters reflect its complexity. The following code will count these parameters and calculate the ratio of trainable ones, giving insight into the model's flexibility during fine-tuning.Let's now decipher these statistics for our Flan-T5 model:def get_model_parameters_info(model): total_parameters = sum(param.numel() for param in model.parameters()) trainable_parameters = sum(param.numel() for param in model.parameters() if param.requires_grad) trainable_percentage = 100 * trainable_parameters / total_parameters info = ( f"Trainable model parameters: {trainable_parameters}", f"Total model parameters: {total_parameters}", f"Percentage of trainable model parameters: {trainable_percentage:.2f}%" ) return '\n'.join(info) print(get_model_parameters_info(original_model))Trainable model parameters: 247577856 Total model parameters: 247577856 Percentage of trainable model parameters: 100.00%Harnessing PEFT for EfficiencyIn the fine-tuning journey, we seek methods that boost efficiency without sacrificing performance. This brings us to PEFT (Parameter-Efficient Fine-Tuning) and its secret weapon, LORA (Low-Rank Adaptation). LORA smartly adapts a model to new tasks with minimal parameter adjustments, offering a cost-effective solution in computational terms.In the code block that follows, we're initializing LORA's configuration. Key parameters to note include:r: The rank of the low-rank decomposition, which influences the number of adaptable parameters.lora_alpha: A scaling factor determining the initial magnitude of the LORA parameters.target_modules: The neural network components we wish to reparameterize. Here, we're targeting the "q" (query) and "v" (value) modules in the transformer's attention mechanism.lora_dropout:  A regularising dropout is applied to the LORA parameters to prevent overfitting.bias: Specifies the nature of the bias term in the reparameterization. Setting it to "none" means no bias will be added.task_type: Signifying the type of task for which we're employing LORA. In our case, it's sequence-to-sequence language modelling, perfectly aligned with our Flan-T5's capabilities.Using the get_peft_model function, we integrate the LORA configuration into our Flan-T5 model. Now, let's see how this affects the trainable parameters: peft_model = get_peft_model(original_model, lora_config) print(get_model_parameters_info(peft_model))Trainable model parameters: 3538944 Total model parameters: 251116800 Percentage of trainable model parameters: 1.41%Preparing for model training requires setting specific parameters. Directory choice, learning rate, logging frequency, and epochs are vital. A unique output directory segregates results from different training runs, enabling comparison. Our high learning rate signifies aggressive fine-tuning, while allocating 100 epochs ensures ample adaptation time for the model. With these settings, we're poised to initiate the trainer and embark on the training journey.# Set the output directory with a unique name using a timestamp output_dir = f'peft-dialogue-summary-training-{str(int(time.time()))}' # Define the training arguments for PEFT model training peft_training_args = TrainingArguments( output_dir=output_dir, auto_find_batch_size=True, # Automatically find an optimal batch size learning_rate=1e-3, # Use a higher learning rate for fine-tuning num_train_epochs=10, # Set the number of training epochs logging_steps=1000, # Log every 500 steps for more frequent logging max_steps=-1 # Let the number of steps be determined by epochs and dataset size ) # Initialise the trainer with PEFT model and training arguments peft_trainer = Trainer( model=peft_model, args=peft_training_args, train_dataset=formatted_datasets["train"], ) Let the learning begin!peft_trainer.train()To evaluate our models, we'll compare their summaries to a human baseline from our dataset using a `prompt`. With the original and PEFT-enhanced Flan-T5 models, we'll create summaries and contrast them with the human version, revealing AI accuracy and the best-performing model in our summary contest.def generate_summary(model, tokenizer, dialogue, prompt): """ Generate summary for a given dialogue and model. """ input_text = prompt + dialogue input_ids = tokenizer(input_text, return_tensors="pt").input_ids input_ids = input_ids.to(device) output_ids = model.generate(input_ids=input_ids, max_length=200, num_beams=1, early_stopping=True) return tokenizer.decode(output_ids[0], skip_special_tokens=True) index = 270 dialogue = dataset['test'][index]['dialogue'] human_baseline_summary = dataset['test'][index]['summary'] prompt = "Summarise the following conversation:\n\n" # Generate summaries original_summary = generate_summary(original_model, tokenizer, dialogue, prompt) peft_summary = generate_summary(peft_model, tokenizer, dialogue, prompt) # Print summaries print_output('BASELINE HUMAN SUMMARY:', human_baseline_summary) print_output('ORIGINAL MODEL:', original_summary) print_output('PEFT MODEL:', peft_summary)And the output:----------------------------------------------------------------------- BASELINE HUMAN SUMMARY:: #Person1# and #Person1#'s mother are preparing the fruits they are going to take to the picnic. ----------------------------------------------------------------------- ORIGINAL MODEL:: #Person1# asks #Person2# to take some fruit for the picnic. #Person2# suggests taking grapes or apples.. ----------------------------------------------------------------------- PEFT MODEL:: Mom and Dad are going to the picnic. Mom will take the grapes and the oranges and take the oranges.To assess our summarization models, we use the subset of the test dataset. We'll compare the summaries to human-created baselines. Using batch processing for efficiency, dialogues are processed in set group sizes. After processing, all summaries are compiled into a DataFrame for structured comparison and analysis. Below is the Python code for this experiment.dialogues = dataset['test'][0:20]['dialogue'] human_baseline_summaries = dataset['test'][0:20]['summary'] original_model_summaries = [] peft_model_summaries = [] for dialogue in dialogues:    prompt = "Summarize the following conversation:\n\n"      original_summary = generate_summary(original_model, tokenizer, dialogue, prompt)       peft_summary = generate_summary(peft_model, tokenizer, dialogue, prompt)      original_model_summaries.append(original_summary)    peft_model_summaries.append(peft_summary) df = pd.DataFrame({    'human_baseline_summaries': human_baseline_summaries,    'original_model_summaries': original_model_summaries,    'peft_model_summaries': peft_model_summaries }) dfTo evaluate our PEFT model's summaries, we use the ROUGE metric, a common summarization tool. ROUGE measures the overlap between predicted summaries and human references, showing how effectively our models capture key details. The Python code for this evaluation is:rouge = evaluate.load('rouge') original_model_results = rouge.compute( predictions=original_model_summaries, references=human_baseline_summaries[0:len(original_model_summaries)], use_aggregator=True, use_stemmer=True, ) peft_model_results = rouge.compute( predictions=peft_model_summaries, references=human_baseline_summaries[0:len(peft_model_summaries)], use_aggregator=True, use_stemmer=True, ) print('ORIGINAL MODEL:') print(original_model_results) print('PEFT MODEL:') print(peft_model_results) Here is the output:ORIGINAL MODEL: {'rouge1': 0.3870781853986991, 'rouge2': 0.13125454660387353, 'rougeL': 0.2891907205395029, 'rougeLsum': 0.29030342767482775} INSTRUCT MODEL: {'rouge1': 0.3719168722187023, 'rouge2': 0.11574429294744135, 'rougeL': 0.2739614480462256, 'rougeLsum': 0.2751489358330983} PEFT MODEL: {'rouge1': 0.3774164144865605, 'rouge2': 0.13204737323990984, 'rougeL': 0.3030487123408395, 'rougeLsum': 0.30499897454317104}Upon examining the results, it's evident that the original model shines with the highest ROUGE-1 score, adeptly capturing crucial standalone terms. On the other hand, the PEFT Model wears the crown for both ROUGE-L and ROUGE-Lsum metrics. This implies the PEFT Model excels in crafting summaries that string together longer, coherent sequences echoing those in the reference summaries.ConclusionWrapping it all up, in this post we delved deep into the nuances of fine-tuning Large Language Models, particularly spotlighting the prowess of FLAN T5. Through our hands-on venture into the dialogue summarization task, we discerned the intricate dance between capturing individual terms and weaving them into a coherent narrative. While the original model exhibited an impressive knack for highlighting key terms, the PEFT Model emerged as the maestro in crafting flowing, meaningful sequences.It's clear that in the grand arena of language models, knowing the notes is just the beginning; it's how you orchestrate them that creates the magic. Harnessing the techniques illuminated in this post, you too can fine-tune your chosen LLM, crafting your linguistic symphonies with finesse and flair. Here's to you becoming the maestro of your own linguistic ensemble!Author BioAmita Kapoor is an accomplished AI consultant and educator with over 25 years of experience. She has received international recognition for her work, including the DAAD fellowship and the Intel Developer Mesh AI Innovator Award. She is a highly respected scholar with over 100 research papers and several best-selling books on deep learning and AI. After teaching for 25 years at the University of Delhi, Amita retired early and turned her focus to democratizing AI education. She currently serves as a member of the Board of Directors for the non-profit Neuromatch Academy, fostering greater accessibility to knowledge and resources in the field. After her retirement, Amita founded NePeur, a company providing data analytics and AI consultancy services. In addition, she shares her expertise with a global audience by teaching online classes on data science and AI at the University of Oxford. 
Read more
  • 0
  • 0
  • 17691

article-image-text-classification-with-transformers
Saeed Dehqan
28 Aug 2023
9 min read
Save for later

Text Classification with Transformers

Saeed Dehqan
28 Aug 2023
9 min read
IntroductionThis blog aims to implement binary text classification using a transformer architecture. If you're new to transformers, the "Transformer Building Blocks" blog explains the architecture and its text generation implementation. Beyond text generation and translation, transformers serve classification, sentiment analysis, and speech recognition. The transformer model comprises two parts: an encoder and a decoder. The encoder extracts features, while the decoder processes them. Just as a painter with tree features can draw, describe, visualize, categorize, or write about a tree, transformers encode knowledge (encoder) and apply it (decoder). This dual-part process is pivotal for text classification with transformers, allowing them to excel in diverse tasks like sentiment analysis, illustrating their transformative role in NLP.Deep Dive into Text Classification with TransformersWe train the model on the IMDB dataset. The dataset is ready and there’s no preprocessing needed. The model is vocab-based instead of character-based so that the model can converge faster. I limited the dataset vocabs to the 20000 most frequent vocabs. I also reduced the sequence to 200 so we can train faster. I tried to simplify the model and use torch.nn.MultiheadAttention it instead of writing the Multihead-attention ourselves. It makes the model faster since the nn.MultiheadAttention uses scaled_dot_product_attention under the hood. But if you want to know how MultiheadAttention works you can study the transformer building blocks blog or see the code here.Okay, now, let us add the feature extractor part:class transformer_block(nn.Module): def __init__(self):     super(block, self).__init__()     self.attention = nn.MultiheadAttention(embeds_size, num_heads, batch_first=True)     self.ffn = nn.Sequential(         nn.Linear(embeds_size, 4 * embeds_size),         nn.LeakyReLU(),         nn.Linear(4 * embeds_size, embeds_size),     )     self.drop1 = nn.Dropout(drop_prob)     self.drop2 = nn.Dropout(drop_prob)     self.ln1 = nn.LayerNorm(embeds_size, eps=1e-6)     self.ln2 = nn.LayerNorm(embeds_size, eps=1e-6) def forward(self, hidden_state):     attn, _ = self.attention(hidden_state, hidden_state, hidden_state, need_weights=False)     attn = self.drop1(attn)     out = self.ln1(hidden_state + attn)     observed = self.ffn(out)     observed = self.drop2(observed)     return self.ln2(out + observed)●    hidden_state: A tensor with a shape (batch_size, block_size, embeds_size) goes to the transformer_block and a tensor with the same shape goes out of it.●    self.attention: The transformer block tries to combine the information of tokens so that each token is aware of its neighbors or other tokens in the context. We may call this part the communication part. That’s what the nn.MultiheadAttention does. nn.MultiheadAttention is a ready multihead attention layer that can be faster than implementing it from scratch, just like what we did in the “Transformer Building Blocks” blog. The parameters of nn.MultiheadAttention are as follows:     ○    embeds_size: token embedding size     ○    num_heads: multihead, as the name suggests, consists of multiple heads and each head works on different parts of token embeddings. Suppose, your input data has shape (B,T,C) = (10, 32, 16). The token embedding size for this data is 16. If we specify the num_heads parameter to 2(divisible by 16), the multi-head splits data into two parts with shape (10, 32, 8). The first head works on the first part and the second head works on the second part. This is because transforming data into different subspaces can help the model to see different aspects of the data. Please note that the num_heads should be divisible by the embedding size so that at the end we can concatenate the split parts.    ○    batch_first: True means the first dimension is batch.●    Dropout: After the attention layer, the communication between tokens is closed and computations on tokens are done individually. We run a dropout on tokens. Dropout is a method of regularization. Regularization helps the training process to be based on generalization, not memorization. Without regularization, the model tries to memorize the training set and has poor performance on the test set. The dropout method turns off features with a probability of drop_prob.●    self.ln1: Layer normalization normalizes embeddings so that they have zero mean and standard deviation one.●    Residual connection: hidden_state + attn: Observe that before normalization, we added the input to the output of multihead attention, named residual connection. It has two benefits:   ○    It helps the model to have the unchanged embedding information.   ○    It helps to prevent gradient vanishing, which is common in deep networks where we stack multiple transformer layers.●    self.ffn: After dropout, residual connection, and normalization, we forward data into a simple non-linear neural network to adjust the tokens one by one for better representation.●    self.ln2(out + observed): Finally, another dropout, residual connection, and layer normalization.The transformer block is ready. And here is the final piece:class transformer(nn.Module): def __init__(self):     super(transformer, self).__init__()     self.tok_embs = nn.Embedding(vocab_size, embeds_size)     self.pos_embs = nn.Embedding(block_size, embeds_size)     self.block = block()     self.ln1 = nn.LayerNorm(embeds_size)     self.ln2 = nn.LayerNorm(embeds_size)     self.classifier_head = nn.Sequential(         nn.Linear(embeds_size, embeds_size),         nn.LeakyReLU(),         nn.Dropout(drop_prob),         nn.Linear(embeds_size, embeds_size),         nn.LeakyReLU(),         nn.Linear(embeds_size, num_classes),         nn.Softmax(dim=1),     )     print("number of parameters: %.2fM" % (self.num_params()/1e6,)) def num_params(self):     n_params = sum(p.numel() for p in self.parameters())     return n_params def forward(self, seq):     B,T = seq.shape     embedded = self.tok_embs(seq)     embedded = embedded + self.pos_embs(torch.arange(T, device=device))     output = self.block(embedded)     output = output.mean(dim=1)     output = self.classifier_head(output)     return output●    self.tok_embs: nn.Embedding is like a lookup table that receives a sequence of indices, and returns their corresponding embeddings. These embeddings will receive gradients so that the model can update them to make better predictions.●    self.tok_embs: To comprehend a sentence, you not only need words, you also need to have the order of words. Here, we embed positions and add them to the token embeddings. In this way, the model has both words and their order.●    self.block: In this model, we only use one transformer block, but you can stack more blocks to get better results.●    self.classifier_head: This is where we put the extracted information into action to classify the sequence. We call it the transformer head. It receives a fixed-size vector and classifies the sequence. The softmax as the final activation function returns a probability distribution for each class.●    self.tok_embs(seq): Given a sequence of indices (batch_size, block_size), it returns (batch_size, block_size, embeds_size).●    self.pos_embs(torch.arange(T, device=device)): Given a sequence of positions, i.e. [0,1,2], it returns embeddings of each position. Then, we add them to the token embeddings.●    self.block(embedded): The embedding goes to the transformer block to extract features. Given the embedded shape (batch_size, block_size, embeds_size), the output has the same shape (batch_size, block_size, embeds_size).●    output.mean(dim=1): The purpose of using mean is to aggregate the information from the sequence into a compact representation before feeding it into self.classifier_head. It helps in reducing the spatial dimensionality and extracting the most important features from the sequence. Given the input shape (batch_size, block_size, embeds_size), the output shape is (batch_size, embeds_size). So, one fixed-size vector for each batch.●    self.classifier_head(output): And here we classify.The final code can be found here. The remaining code consists of downstream tasks such as the training loop, loading the dataset, setting the hyperparameters, and optimizer. I used RMSprop instead of Adam and AdamW. I also used BCEWithLogitsLoss instead of cross-entropy loss. BCE(Binary Cross Entropy) is for binary classification models and it combines sigmoid with cross entropy and it is numerically more stable. I also empirically got better accuracy. After 30 epochs, the final accuracy is ~84%.ConclusionThis exploration of text classification using transformers reveals their revolutionary potential. Beyond text generation, transformers excel in sentiment analysis. The encoder-decoder model, analogous to a painter interpreting tree feature, propels efficient text classification. A streamlined practical approach and the meticulously crafted transformer block enhance the architecture's robustness. Through optimization methods and loss functions, the model is honed, yielding an empirically validated 84% accuracy after 30 epochs. This journey highlights transformers' disruptive impact on reshaping AI-driven language comprehension, fundamentally altering the landscape of Natural Language Processing.Author BioSaeed Dehqan trains language models from scratch. Currently, his work is centered around Language Models for text generation, and he possesses a strong understanding of the underlying concepts of neural networks. He is proficient in using optimizers such as genetic algorithms to fine-tune network hyperparameters and has experience with neural architecture search (NAS) by using reinforcement learning (RL). He implements models starting from data gathering to monitoring, and deployment on mobile, web, cloud, etc. 
Read more
  • 0
  • 0
  • 17395

article-image-transformer-building-blocks
Saeed Dehqan
29 Aug 2023
22 min read
Save for later

Transformer Building Blocks

Saeed Dehqan
29 Aug 2023
22 min read
IntroductionTransformers employ potent techniques to preprocess tokens before sequentially inputting them into a neural network, aiding in the selection of the next token. At the transformer's apex is a basic neural network, the transformer head. The text generator model processes input tokens and generates a probability distribution for subsequent tokens. Context length, termed context length or block size, is recognized as a hyperparameter denoting input token count. The model's primary aim is to predict the next token based on input tokens (referred to as context tokens or context windows). Our goal with n tokens is to predict the subsequent fitting token following previous ones. Thus, we rely on these n tokens to anticipate the next. As humans, we attempt to grasp the conversation's context - our location and a loose foresight of the path's culmination. Upon gathering pertinent insights, relevant words emerge, while irrelevant ones fade, enabling us to choose the next word with precision. We occasionally err but backtrack, a luxury transformers lack. If they incorrectly predict (an irrelevant token), they persist, though exceptions exist, like beam search. Unlike us, transformers can't forecast. Revisiting n prior tokens, our human assessment involves inspecting them individually, and discerning relationships from diverse angles. By prioritizing pivotal tokens and disregarding superfluous ones, we evaluate tokens within various contexts. We scrutinize all n prior tokens individually, ready to prognosticate. This embodies the essence of the multihead attention mechanism in transformers. Consider a context window with 5 tokens. Each wears a distinct mask, predicting its respective next token:"To discern the void amidst, we must first grasp the fullness within." To understand what token is lacking, we must first identify what we are and possess. We need communication between tokens since tokens don’t know each other yet and in order to predict their own next token, they first need to know each other well and pair together in such a way that tokens with similar characteristics stay near each other (technically having similar vectors). Each token has three vectors that represent:●    What tokens they are looking for (known as query)●    What they really have (known as key)●    What they are (known as value)Each token with its query starts looking for similar keys, finds each other, and starts to know one another by adding up their values:Similar tokens find each other and if a token is somehow dissimilar, here Token 4, other tokens don’t consider it much. But please note that every token (much or less) has its own effect on other tokens. Also, in self-attention, all tokens ask all other tokens with their query and keys to find familiar tokens, but not the future tokens, named masked self-attention. We prohibit tokens from communicating to future tokens. After exchanging information between tokens and mixing up their values, similar tokens become more similar:As you can see, the color of similar tokens becomes more similar(in action, their vectors become more similar). Since tokens in the group wear a mask, we cannot access the true tokens’ values. We just know and distinguish them from their mask(value). This is because every token has different characteristics in different contexts, and they don’t show their true essence.So far so good; we have finished the self-attention process and now, the group is ready to predict their next tokens. This is because individuals are aware of each other very well, and as a result, they can guess the next token better. Now, each token separately needs to go to a nonlinear network and then to the transformer head, to predict its own next token. We ask each one of the tokens separately to tell their opinion about the probability of what token comes next. Finally, we collect the probability distributions of all tokens in the context window. A probability distribution sums up to 100, or actually in action to 1. We give probability to every token the model has in its vocabulary. The simplest method to extract the next token from probability distributions is to select the one with the highest probability:As you can see, each token goes to the neural network and the network returns a probability distribution. The result is the following sentence: “It looks like a bug”.Voila! We managed to go through a simple Transformer model.Let’s recap everything we’ve said. A transformer receives n tokens as input, does some stuff (like self-attention, layer normalization, etc.) and feed-forward them into a neural network to get probability distributions of the next token. Each token goes to the neural network separately; if the number of tokens is 10, there are 10 probability distributions.At this point, you know intuitively how the main building blocks of a transformer work. But let us better understand them by implementing a transformer model.Clone the repository tiny-transformer:git clone https://github.com/saeeddhqan/tiny-transformerExecute simple_model.py in the repository If you simply want to run the model for training.Create a new file, and import the necessary modules:import math import torch import torch.nn as nn import torch.nn.functional as F Load the dataset and write the tokenizer: with open('shakespeare.txt') as fp: text = fp.read() chars = sorted(list(set(text))) vocab_size = len(chars) stoi = {c:i for i,c in enumerate(chars)} itos = {i:c for c,i in stoi.items()} encode = lambda s: [stoi[x] for x in s] decode = lambda e: ''.join([itos[x] for x in e])●    Open the dataset, and define a variable that is a list of all unique characters in the text.●    The set function splits the text character by character and then removes duplicates, just like sets in set theory. list(set(myvar)) is a way of removing duplicates in a list or string.●    vocab_size is the number of unique characters (here 65). ●    stoi is a dictionary where its keys are characters and values are their indices.●    itos is used to convert indices to characters. ●    encode function receives a string and returns indices of characters. ●    decode receives a list of indices and returns a string. Split the dataset into test and train and write a function that returns data for training:device = 'cuda' if torch.cuda.is_available() else 'cpu' torch.manual_seed(1234) data = torch.tensor(encode(text), dtype=torch.long).to(device) train_split = int(0.9 * len(data)) train_data = data[:train_split] test_data = data[train_split:] def get_batch(split='train', block_size=16, batch_size=1) -> 'Create a random batch and returns batch along with targets': data = train_data if split == 'train' else test_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i + block_size] for i in ix]) y = torch.stack([data[i+1:i + block_size + 1] for i in ix]) return x, y●    Choose a suitable device.●     Set a seed to make the training reproducible.●    Convert the text into a large list of indices with the encode function.●    Since the character indices are integer, we use torch.long data type to make the data suitable for the model. ●    90% for training and 10% for testing.●    If the batch_size is 10, we select 10 chunks or sequences from the dataset and stack them up to process them simultaneously. ●    If the batch_size is 1, get_batch function selects 1 random chunk (n consequence characters) from the dataset and returns x and y, where x is 16 characters’ indices and y is the target characters for x.The shape, value, and decoded version of the selected chunk are as follows:shape x: torch.Size([1, 16]) shape y: torch.Size([1, 16]) value x: tensor([[41, 43, 6, 1, 60, 47, 50, 50, 39, 47, 52, 2, 1, 52, 43, 60]]) value y: tensor([[43, 6, 1, 60, 47, 50, 50, 39, 47, 52, 2, 1, 52, 43, 60, 43]]) decoded x: ce, villain! nev decoded y: e, villain! neveWe usually process multiple chunks or sequences at once with batching in order to speed up the training. For each character, we have an equivalent target, which is its next token. The target for ‘c’ is ‘e’, for ‘e’ is ‘,’, for ‘v’ is ‘i’, and so on. Let us talk a bit about the input shape and output shape of tensors in a transformer model. The model receives a list of token indices like the above(named a sequence, or chunk) and maps them into their corresponding vectors.●    The input shape is (batch_size, block_size).●    After mapping indices into vectors, the data shape becomes (batch_size, block_size, embed_size).●    Then, through the multihead attention and feed-forward layers, the data shape does not change.●    Finally, the data with shape (batch_size, block_size, embed_size) goes to the transformer head (a simple neural network) and the output shape becomes (batch_size, block_size, vocab_size). vocab_size is the number of unique characters that can come next (for the Shakespeare dataset, the number of unique characters is 65).Self-attentionThe communication between tokens happens in the head class; we define the scores variable to save the similarity between vectors. The higher the score is, the more two vectors have in common. We then utilize these scores to do a weighted sum of all the vectors: class head(nn.Module): def __init__(self, embeds_size=32, block_size=16, head_size=8):     super().__init__()     self.key = nn.Linear(embeds_size, head_size, bias=False)     self.query = nn.Linear(embeds_size, head_size, bias=False)     self.value = nn.Linear(embeds_size, head_size, bias=False)     self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))     self.dropout = nn.Dropout(0.1) def forward(self, x):     B,T,C = x.shape     # What am I looking for?     q = self.query(x)     # What do I have?     k = self.key(x)     # What is the representation value of me?     # Or: what's my personality in the group?     # Or: what mask do I have when I'm in a group?     v = self.value(x)     scores = q @ k.transpose(-2,-1) * (1 / math.sqrt(C)) # (B,T,head_size) @ (B,head_size,T) --> (B,T,T)     scores = scores.masked_fill(self.tril[:T, :T] == 0, float('-inf'))     scores = F.softmax(scores, dim=-1)     scores = self.dropout(scores)     out = scores @ v     return outUse three linear layers to transform the vector into key, query, and value, but with a smaller dimension (here same as head_size).Q, K, V: Q and K are for when we want to find similar tokens. We calculate the similarity between vectors with a dot product: q @ k.transpose(-2, -1). The shape of scores is (batch_size, block_size, block_size), which means we have the similarity scores between all the vectors in the block. V is used when we want to do the weighted sum. Scores: Pure dot product scores tend to have very high numbers that are not suitable for softmax since it makes the scores dense. Therefore, we rescale the results with a ratio of (1 / math.sqrt(C)). C is the embedding size. We call this a scaled dot product.Register_buffer: We used register_buffer to register a lower triangular tensor. In this way, when you save and load the model, this tensor also becomes part of the model.Masking: After calculating the scores, we need to replace future scores with -inf to shut them off so that the vectors do not have access to the future tokens. By doing so, these scores effectively become zero after applying the softmax function, resulting in a probability of zero for the future tokens. This process is referred to as masking. Here’s an example of masked scores with a block size 4:[[-0.1710, -inf, -inf, -inf], [ 0.2007, -0.0878, -inf, -inf], [-0.0405, 0.2913, 0.0445, -inf], [ 0.1328, -0.2244, 0.0796, 0.1719]]Softmax: It converts a vector into a probability distribution that sums up to 1. Here’s the scores after softmax:      [[1.0000, 0.0000, 0.0000, 0.0000],      [0.5716, 0.4284, 0.0000, 0.0000],      [0.2872, 0.4002, 0.3127, 0.0000],      [0.2712, 0.1897, 0.2571, 0.2820]]The scores of future tokens are zero; after doing a weighted sum, the future vectors become zero and the vectors receive none data from future vectors(n*0=0)Dropout: Dropout is a regularization technique. It drops some of the numbers in vectors randomly. Dropout helps the model to generalize, not memorize the dataset. We don’t want the model to memorize the Shakespeare model, right? We want it to create new texts like the dataset.Weighted sum: Weighted sum is used to combine different representations or embeddings based on their importance. The scores are calculated by measuring the relevance or similarity between each pair of vectors. The relevance scores are obtained by applying a scaled dot product between the query and key vectors, which are learned during the training process. The resulting weighted sum emphasizes the more important elements and reduces the influence of less relevant ones, allowing the model to focus on the most salient information. We dot product scores with values and the result is the outcome of self-attention.Output: since the embedding size and head size are 32 and 8 respectively, if the input shape is (batch_size, block_size, 32), the output has the shape of (batch_size, block_size, 8).Multihead self-attention“I have multiple personalities(v), tendencies and needs (q), and valuable things (k) in different spaces”. Vectors said.We transform the vectors into small dimensions, and then run self-attention on them; we did this in the previous class. In multihead self-attention, we call the head class four times, and then, concatenate the smaller vectors to have the same input shape. We call this multihead self-attention. For instance, if the shape of input data is (1, 16, 32), we transform it into four (1, 16, 8) tensors and run self-attention on these tensors. Why four times? 4 * 8 = initial shape. By using multihead self-attention and running self-attention multiple times, what we do is consider the different aspects of vectors in different spaces. That’s all!Here is the code:class multihead(nn.Module): def __init__(self, num_heads=4, head_size=8):     super().__init__()     self.multihead = nn.ModuleList([head(head_size) for _ in range(num_heads)])     self.output_linear = nn.Linear(embeds_size, embeds_size)     self.dropout = nn.Dropout(0.1) def forward(self, hidden_state):     hidden_state = torch.cat([head(hidden_state) for head in self.multihead], dim=-1)     hidden_state = self.output_linear(hidden_state)     hidden_state = self.dropout(hidden_state)     return hidden_state●    self.multihead: The variable creates four heads and we do this with nn.ModuleList.●    self.output_linear: Another transformer linear layer we apply at the end of the multihead self-attention process.●    self.dropout: Using dropout on the final results.●    hidden_state 1: Concatenating the output of heads so that we have the same shape as input. Heads transform data into different spaces with smaller dimensions, and then do the self-attention.●    hidden_state 2: After doing communication between tokens with self-attention, we use the self.output_linear projector to let the model adjust vectors further based on the gradients that flow through the layer.●    dropout: Run dropout on the output of the projection with a 10% probability of turning off values (make them zero) in the vectors.Transformer blockThere are two new techniques, including layer normalization and residual connection, that need to be explained:class transformer_block(nn.Module): def __init__(self, embeds_size=32, num_heads=8):     super().__init__()     self.head_count = embeds_size // num_heads     self.n_heads = multihead(num_heads, self.head_count)     self.ffn = nn.Sequential(         nn.Linear(embeds_size, 4 * embeds_size),         nn.ReLU(),         nn.Linear(4 * embeds_size, embeds_size),         nn.Dropout(drop_prob),     )     self.ln1 = nn.LayerNorm(embeds_size)     self.ln2 = nn.LayerNorm(embeds_size) def forward(self, hidden_state):     hidden_state = hidden_state + self.n_heads(self.ln1(hidden_state))     hidden_state = hidden_state + self.ffn(self.ln2(hidden_state))     return hidden_state self.head_count: Calculates the head size. The number of heads should be divisible by the embedding size so that we can concatenate the output of heads.self.n_heads: The multihead self-attention layer. self.ffn: This is the first time that we have non-linearity in our model. Non-linearity helps the model to capture complex relationships and patterns in the data. By introducing non-linearity through ReLU activation functions, or GLUE, the model can make a correlation for the data. As a result, it better models the intricacies of the input data. Non-linearity is like “you go to the next layer”, “You don’t go to the next layer”, or “Create y from x for the next layer”. The recommended hidden layer size is a number four times bigger than the embedding size. That’s why “4 * embeds_size”. You can also try SwiGLU as the activation function instead of ReLU.self.ln1 and self.ln2: Layer normalizers make the model more robust and they also help the model to converge faster. Layer normalization rescales the data in such a way that the mean is zero and the standard deviation is one. hidden_state 1: Normalize the vectors with self.ln1 and forward the vectors to the multihead attention. Next, we add the input to the output of multihead attention. It helps the model in two ways:○    First, the model has some information from the original vectors. ○    Second, when the model becomes deep, during backpropagation, the gradients will be weak for earlier layers and the model will converge too slowly. We recognize this effect as gradient vanishing. Adding the input helps to enrich the gradients and mitigate the gradient vanishing. We recognize it as a residual connection.hidden_state 2: Hidden_state 1 goes to a layer normalization and then to a nonlinear network. The output will be added to the hidden state with the aim of keeping gradients for all layers.The modelAll the necessary parts are ready, let us stack them up to make the full model:class transformer(nn.Module): def __init__(self):     super().__init__()     self.stack = nn.ModuleDict(dict(         tok_embs=nn.Embedding(vocab_size, embeds_size),         pos_embs=nn.Embedding(block_size, embeds_size),         dropout=nn.Dropout(drop_prob),         blocks=nn.Sequential(             transformer_block(),             transformer_block(),             transformer_block(),             transformer_block(),             transformer_block(),         ),         ln=nn.LayerNorm(embeds_size),         lm_head=nn.Linear(embeds_size, vocab_size),     ))●    self.stack: A list of all necessary layers.●    tok_embs: This is a learnable lookup table that receives a list of indices and returns their vectors.●    pos_embs: Just like tok_embs, it is also a learnable look-up table, but for positional embedding. It receives a list of positions and returns their vectors.●    dropout: Dropout layer.●    blocks: We create multiple transformer blocks sequentially.●    ln: A layer normalization.●    lm_heas: Transformer head receives a token and returns probabilities of the next token. To change the model to be a classifier, or a sentimental analysis model, we just need to change this layer and remove masking from the self-attention layer.The forward method of the transformer class:    def forward(self, seq, targets=None):     B, T = seq.shape     tok_emb = self.stack.tok_embs(seq) # (batch, block_size, embed_dim) (B,T,C)     pos_emb = self.stack.pos_embs(torch.arange(T, device=device))     x = tok_emb + pos_emb     x = self.stack.dropout(x)     x = self.stack.blocks(x)     x = self.stack.ln(x)     logits = self.stack.lm_head(x) # (B, block_size, vocab_size)     if targets is None:         loss = None     else:         B, T, C = logits.shape         logits = logits.view(B * T, C)         targets = targets.view(B * T)         loss = F.cross_entropy(logits, targets)     return logits, loss●  tok_emb: Convert token indices into vectors. Given the input (B, T), the output is (B, T, C), where C is the embeds_size.●  pos_emb: Given the number of tokens in the context window or block_size, it returns the positional embedding of each position.●  x 1: Add up token embeddings and position embeddings. A little bit lossy but it works just fine.●  x 2: Run dropout on embeddings.●  x 3: The embeddings go through all the transformer blocks, and multihead self-attention. The input is (B, T, C) and the output is (B, T, C).●  x 4: The outcome of transformer blocks goes to the layer normalization.●  logits: We usually recognize the unnormalized values extracted from the language model head as logits :)●  if-else block: Were the targets specified, we calculated cross-entropy loss. Otherwise, the loss will be None. Before calculating loss in the else block, we change the shape as the cross entropy function expects.●  Output: The method returns logits with shape (batch_size, block_size, vocab_size) and loss if any.For generating a text, add this to the transformer class:    def autocomplete(self, seq, _len=10):        for _ in range(_len):            seq_crop = seq[:, -block_size:] # crop it            logits, _ = self(seq_crop)            logits = logits[:, -1, :] # we care about the last token            probs = F.softmax(logits, dim=-1)            next_char = torch.multinomial(probs, num_samples=1)            seq = torch.cat((seq, next_char), dim=1)        return seq●  autocomplete: Given a tokenized sequence, and the number of tokens that need to be created, this method returns _len tokens.●  seq_crop: Select the last n tokens in the sequence to give it to the model. The sequence length might be larger than the block_size and it causes an error if we don’t crop it.●  logits 1: Forward the sequence into the model to receive the logits.●  logits 2: Select the last logit that will be used to select the next token.●  probs: Run the softmax on logits to get a probability distribution.●  next_char: Multinomial selects one sample from the probs. The higher the probability of a token, the higher the chance of being selected.●  seq: Add the selected character to the sequence.TrainingThe rest of the code is downstream tasks such as training loops, etc. The codes that are provided here are slightly different from the tiny-transformer repository. I trained the model with the following hyperparameters:block_size = 256 learning_rate = 9e-4 eval_interval = 300 # Every n step, we do an evaluation. iterations = 5000 # Like epochs batch_size = 64 embeds_size = 195 num_heads = 5 num_layers = 5 drop_prob = 0.15And here’s the generated text:If you need to improve the quality, increase embeds_size, num_layers, and heads.ConclusionThe article explores transformers' text generation role, detailing token preprocessing through self-attention and neural network heads. Transformers predict tokens using context length as a hyperparameter. Human context comprehension is paralleled, highlighting relevant word emergence and fading of irrelevant words for precise selection. Transformers lack human foresight and backtracking. Key components—self-attention, multihead self-attention, and transformer blocks—are explained, and supported by code snippets. Token and positional embeddings, layer normalization, and residual connections are detailed. The model's text generation is exemplified via the autocomplete method. Training parameters and text quality enhancement are addressed, showcasing transformers' potential.Author BioSaeed Dehqan trains language models from scratch. Currently, his work is centered around Language Models for text generation, and he possesses a strong understanding of the underlying concepts of neural networks. He is proficient in using optimizers such as genetic algorithms to fine-tune network hyperparameters and has experience with neural architecture search (NAS) by using reinforcement learning (RL). He implements models starting from data gathering to monitoring, and deployment on mobile, web, cloud, etc. 
Read more
  • 0
  • 0
  • 17357
article-image-llm-powered-chatbots-for-financial-queries
Alan Bernardo Palacio
12 Sep 2023
27 min read
Save for later

LLM-powered Chatbots for Financial Queries

Alan Bernardo Palacio
12 Sep 2023
27 min read
IntroductionIn the ever-evolving realm of digital finance, the convergence of user-centric design and cutting-edge AI technologies is pushing the boundaries of innovation. But with the influx of data and queries, how can we better serve users and provide instantaneous, accurate insights? Within this context, Large Language Models (LLMs) have emerged as a revolutionary tool, providing businesses and developers with powerful capabilities. This hands-on article will walk you through the process of leveraging LLMs to create a chatbot that can query real-time financial data extracted from the NYSE to address users' queries in real time about the current market state. We will dive into the world of LLMs, explore their potential, and understand how they seamlessly integrate with databases using LangChain. Furthermore, we'll fetch real-time data using the finance package offering the chatbot the ability to answer questions using current data.In this comprehensive tutorial, you'll gain proficiency in diverse aspects of modern software development. You'll first delve into the realm of database interactions, mastering the setup and manipulation of a MySQL database to store essential financial ticker data. Unveil the intricate synergy between Large Language Models (LLMs) and SQL through the innovative LangChain tool, which empowers you to bridge natural language understanding and database operations seamlessly. Moving forward, you'll explore the dynamic fusion of Streamlit and LLMs as you demystify the mechanics behind crafting a user-friendly front. Witness the transformation of your interface using OpenAI's Davinci model, enhancing user engagement with its profound knowledge. As your journey progresses, you'll embrace the realm of containerization, ensuring your application's agility and scalability by harnessing the power of Docker. Grasp the nuances of constructing a potent Dockerfile and orchestrating dependencies, solidifying your grasp on holistic software development practices.By the end of this guide, readers will be equipped with the knowledge to design, implement, and deploy an intelligent, finance-focused chatbot. This isn't just about blending frontend and backend technologies; it's about crafting a digital assistant ready to revolutionize the way users interact with financial data. Let's dive in!The Power of Containerization with Docker ComposeNavigating the intricacies of modern software deployment is simplified through the strategic implementation of Docker Compose. This orchestration tool plays a pivotal role in harmonizing multiple components within a local environment, ensuring they collaborate seamlessly.Docker allows to deploy multiple components seamlessly in a local environment. In our journey, we will use docker-compose to harmonize various components, including MySQL for data storage, a Python script as a data fetcher for financial insights, and a Streamlit-based web application that bridges the gap between the user and the chatbot.Our deployment landscape consists of several interconnected components, each contributing to the finesse of our intelligent chatbot. The cornerstone of this orchestration is the docker-compose.yml file, a blueprint that encapsulates the deployment architecture, coordinating the services to deliver a holistic user experience.With Docker Compose, we can efficiently and consistently deploy multiple interconnected services. Let's dive into the structure of our docker-compose.yml:version: '3' services: db:    image: mysql:8.0    environment:      - MYSQL_ROOT_PASSWORD=root_password      - MYSQL_DATABASE=tickers_db      - MYSQL_USER=my_user   # Replace with your desired username      - MYSQL_PASSWORD=my_pass  # Replace with your desired password    volumes:      - ./db/setup.sql:/docker-entrypoint-initdb.d/setup.sql    ports:      - "3306:3306"  # Maps port 3306 in the container to port 3306 on the host ticker-fetcher:    image: ticker/python    build:      context: ./ticker_fetcher    depends_on:      - db    environment:      - DB_USER=my_user   # Must match the MYSQL_USER from above      - DB_PASSWORD=my_pass   # Must match the MYSQL_PASSWORD from above      - DB_NAME=tickers_db app:    build:      context: ./app    ports:      - 8501:8501    environment:      - OPENAI_API_KEY=${OPENAI_API_KEY}    depends_on:      - ticker-fetcherContained within the composition are three distinctive services:db: A MySQL database container, configured with environmental variables for establishing a secure and efficient connection. This container is the bedrock upon which our financial data repository, named tickers_db, is built. A volume is attached to import a setup SQL script, enabling rapid setup.ticker-fetcher: This service houses the heart of our real-time data acquisition system. Crafted around a custom Python image, it plays the crucial role of fetching the latest stock information from Yahoo Finance. It relies on the db service to persistently store the fetched data, ensuring that our chatbot's insights are real-time data.app: The crown jewel of our user interface is the Streamlit application, which bridges the gap between users and the chatbot. This container grants users access to OpenAI's LLM model. It harmonizes with the ticker-fetcher service to ensure that the data presented to users is not only insightful but also dynamic.Docker Compose's brilliance lies in its capacity to encapsulate these services within isolated, reproducible containers. While Docker inherently fosters isolation, Docker Compose takes it a step further by ensuring that each service plays its designated role in perfect sync.The docker-compose.yml configuration file serves as the conductor's baton, ensuring each service plays its part with precision and finesse. As you journey deeper into the deployment architecture, you'll uncover the intricate mechanisms powering the ticker-fetcher container, ensuring a continuous flow of fresh financial data. Through the lens of Docker Compose, the union of user-centric design, cutting-edge AI, and streamlined deployment becomes not just a vision, but a tangible reality poised to transform the way we interact with financial data.Enabling Real-Time Financial Data AcquisitionAt the core of our innovative architecture lies the pivotal component dedicated to real-time financial data acquisition. This essential module operates as the engine that drives our chatbot's ability to deliver up-to-the-minute insights from the ever-fluctuating financial landscape.Crafted as a dedicated Docker container, this module is powered by a Python script that through the yfinance package, retrieves the latest stock information directly from Yahoo Finance. The result is a continuous stream of the freshest financial intelligence, ensuring that our chatbot remains armed with the most current and accurate market data.Our Python script, fetcher.py, looks as follows:import os import time import yfinance as yf import mysql.connector import pandas_market_calendars as mcal import pandas as pd import traceback DB_USER = os.environ.get('DB_USER') DB_PASSWORD = os.environ.get('DB_PASSWORD') DB_NAME = 'tickers_db' DB_HOST = 'db' DB_PORT = 3306 def connect_to_db():    return mysql.connector.connect(        host=os.getenv("DB_HOST", "db"),        port=os.getenv("DB_PORT", 3306),        user=os.getenv("DB_USER"),        password=os.getenv("DB_PASSWORD"),        database=os.getenv("DB_NAME"),    ) def wait_for_db():    while True:        try:            conn = connect_to_db()            conn.close()            return        except mysql.connector.Error:            print("Unable to connect to the database. Retrying in 5 seconds...")            time.sleep(5) def is_market_open():    # Get the NYSE calendar    nyse = mcal.get_calendar('NYSE')    # Get the current timestamp and make it timezone-naive    now = pd.Timestamp.now(tz='UTC').tz_localize(None)    print("Now its:",now)    # Get the market open and close times for today    market_schedule = nyse.schedule(start_date=now, end_date=now)    # If the market isn't open at all today (e.g., a weekend or holiday)    if market_schedule.empty:        print('market is empty')        return False    # Today's schedule    print("Today's schedule")    # Check if the current time is within the trading hours    market_open = market_schedule.iloc[0]['market_open'].tz_localize(None)    market_close = market_schedule.iloc[0]['market_close'].tz_localize(None)    print("market_open",market_open)    print("market_close",market_close)    market_open_now = market_open <= now <= market_close    print("Is market open now:",market_open_now)    return market_open_now def chunks(lst, n):    """Yield successive n-sized chunks from lst."""    for i in range(0, len(lst), n):        yield lst[i:i + n] if __name__ == "__main__":    wait_for_db()    print("-"*50)    tickers = ["AAPL", "GOOGL"]  # Add or modify the tickers you want      print("Perform backfill once")    # historical_backfill(tickers)    data = yf.download(tickers, period="5d", interval="1m", group_by="ticker", timeout=10) # added timeout    print("Data fetched from yfinance.")    print("Head")    print(data.head().to_string())    print("Tail")    print(data.head().to_string())    print("-"*50)    print("Inserting data")    ticker_data = []    for ticker in tickers:        for idx, row in data[ticker].iterrows():            ticker_data.append({                'ticker': ticker,                'open': row['Open'],                'high': row['High'],                'low': row['Low'],                'close': row['Close'],                'volume': row['Volume'],                'datetime': idx.strftime('%Y-%m-%d %H:%M:%S')            })    # Insert data in bulk    batch_size=200    conn = connect_to_db()    cursor = conn.cursor()    # Create a placeholder SQL query    query = """INSERT INTO ticker_history (ticker, open, high, low, close, volume, datetime)               VALUES (%s, %s, %s, %s, %s, %s, %s)"""    # Convert the data into a list of tuples    data_tuples = []    for record in ticker_data:        for key, value in record.items():            if pd.isna(value):                record[key] = None        data_tuples.append((record['ticker'], record['open'], record['high'], record['low'],                            record['close'], record['volume'], record['datetime']))    # Insert records in chunks/batches    for chunk in chunks(data_tuples, batch_size):        cursor.executemany(query, chunk)        print(f"Inserted batch of {len(chunk)} records")    conn.commit()    cursor.close()    conn.close()    print("-"*50)    # Wait until starting to insert live values    time.sleep(60)    while True:        if is_market_open():            print("Market is open. Fetching data.")            print("Fetching data from yfinance...")            data = yf.download(tickers, period="1d", interval="1m", group_by="ticker", timeout=10) # added timeout            print("Data fetched from yfinance.")            print(data.head().to_string())                      ticker_data = []            for ticker in tickers:                latest_data = data[ticker].iloc[-1]                ticker_data.append({                    'ticker': ticker,                    'open': latest_data['Open'],                    'high': latest_data['High'],                    'low': latest_data['Low'],                    'close': latest_data['Close'],                    'volume': latest_data['Volume'],                    'datetime': latest_data.name.strftime('%Y-%m-%d %H:%M:%S')                })                # Insert the data                conn = connect_to_db()                cursor = conn.cursor()                print("Inserting data")                total_tickers = len(ticker_data)                for record in ticker_data:                    for key, value in record.items():                        if pd.isna(value):                            record[key] = "NULL"                    query = f"""INSERT INTO ticker_history (ticker, open, high, low, close, volume, datetime)                                VALUES (                                    '{record['ticker']}',{record['open']},{record['high']},{record['low']},{record['close']},{record['volume']},'{record['datetime']}')"""                    print(query)                    cursor.execute(query)                print("Data inserted")                conn.commit()                cursor.close()                conn.close()            print("Inserted data, waiting for the next batch in one minute.")            print("-"*50)            time.sleep(60)        else:            print("Market is closed. Waiting...")            print("-"*50)            time.sleep(60)  # Wait for 60 seconds before checking againWithin its code, the script seamlessly navigates through a series of well-defined stages:Database Connectivity: The script initiates by establishing a secure connection to our MySQL database. With the aid of the connect_to_db() function, a connection is created while the wait_for_db() mechanism guarantees the script's execution occurs only once the database service is fully primed.Market Schedule Evaluation: Vital to the script's operation is the is_market_open() function, which determines the market's operational status. By leveraging the pandas_market_calendars package, this function ascertains whether the New York Stock Exchange (NYSE) is currently active.Data Retrieval and Integration: During its maiden voyage, fetcher.py fetches historical stock data from the past five days for a specified list of tickers—typically major entities such as AAPL and GOOGL. This data is meticulously processed and subsequently integrated into the tickers_db database. During subsequent cycles, while the market is live, the script periodically procures real-time data at one-minute intervals.Batched Data Injection: Handling substantial volumes of stock data necessitates an efficient approach. To address this, the script ingeniously partitions the data into manageable chunks and employs batched SQL INSERT statements to populate our database. This technique ensures optimal performance and streamlined data insertion.Now let’s discuss the Dockerfile that defines this container. The Dockerfile is the blueprint for building the ticker-fetcher container. It dictates how the Python environment will be set up inside the container.FROM python:3 WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["python", "-u", "fetcher.py"] Base Image: We start with a basic Python 3 image.Working Directory: The working directory inside the container is set to /app.Dependencies Installation: After copying the requirements.txt file into our container, the RUN command installs all necessary Python packages.Starting Point: The script's entry point, fetcher.py, is set as the command to run when the container starts.A list of Python packages needed to run our fetcher script:mysql-connector-python yfinance pandas-market-calendarsmysql-connector-python: Enables the script to connect to our MySQL database.yfinance: Fetches stock data from Yahoo Finance.pandas-market-calendars: Determines the NYSE's operational schedule.As a symphony of technology, the ticker-fetcher container epitomizes precision and reliability, acting as the conduit that channels real-time financial data into our comprehensive architecture.Through this foundational component, the chatbot's prowess in delivering instantaneous, accurate insights comes to life, representing a significant stride toward revolutionizing the interaction between users and financial data.With a continuously updating financial database at our disposal, the next logical step is to harness the potential of Large Language Models. The subsequent section will explore how we integrate LLMs using LangChain, allowing our chatbot to transform raw stock data into insightful conversations.Leveraging Large Language Models with SQL using LangChainThe beauty of modern chatbot systems lies in the synergy between the vast knowledge reservoirs of Large Language Models (LLMs) and real-time, structured data from databases. LangChain is a bridge that efficiently connects these two worlds, enabling seamless interactions between LLMs and databases such as SQL.The marriage between LLMs and SQL databases opens a world of possibilities. With LangChain as the bridge, LLMs can effectively query databases, offering dynamic responses based on stored data. This section delves into the core of our setup, the utils.py file. Here, we knit together the MySQL database with our Streamlit application, defining the agent that stands at the forefront of database interactions.LangChain is a library designed to facilitate the union of LLMs with structured databases. It provides utilities and agents that can direct LLMs to perform database operations via natural language prompts. Instead of a user having to craft SQL queries manually, they can simply ask a question in plain English, which the LLM interprets and translates into the appropriate SQL query.Below, we present the code if utils.py, that brings our LLM-database interaction to life:from langchain import PromptTemplate, FewShotPromptTemplate from langchain.prompts.example_selector import LengthBasedExampleSelector from langchain.llms import OpenAI from langchain.sql_database import SQLDatabase from langchain.agents import create_sql_agent from langchain.agents.agent_toolkits import SQLDatabaseToolkit from langchain.agents.agent_types import AgentType # Database credentials DB_USER = 'my_user' DB_PASSWORD = 'my_pass' DB_NAME = 'tickers_db' DB_HOST = 'db' DB_PORT = 3306 mysql_uri = f"mysql+mysqlconnector://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}" # Initialize the SQLDatabase and SQLDatabaseToolkit db = SQLDatabase.from_uri(mysql_uri) toolkit = SQLDatabaseToolkit(db=db, llm=OpenAI(temperature=0)) # Create SQL agent agent_executor = create_sql_agent(    llm=OpenAI(temperature=0),    toolkit=toolkit,    verbose=True,    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION, ) # Modified the generate_response function to now use the SQL agent def query_db(prompt):    return agent_executor.run(prompt)Here's a breakdown of the key components:Database Credentials: These credentials are required to connect our application to the tickers_db database. The MySQL URI string represents this connection.SQLDatabase Initialization: Using the SQLDatabase.from_uri method, LangChain initializes a connection to our database. The SQLDatabaseToolkit provides a set of tools that help our LLM interact with the SQL database.Creating the SQL Agent: The SQL agent, in our case a ZERO_SHOT_REACT_DESCRIPTION type, is the main executor that takes a natural language prompt and translates it into SQL. It then fetches the data and returns it in a comprehensible manner. The agent uses the OpenAI model (an instance of LLM) and the aforementioned toolkit to accomplish this.The query_db function: This is the interface to our SQL agent. Upon receiving a prompt, it triggers the SQL agent to run and then returns the response.The architecture is now in place: on one end, we have a constant stream of financial data being fetched and stored in tickers_db. On the other, we have an LLM ready to interpret and answer user queries. The user might ask, "What was the closing price of AAPL yesterday?" and our system will seamlessly fetch this from the database and provide a well-crafted response, all thanks to LangChain.In the forthcoming section, we'll discuss how we present this powerful capability through an intuitive interface using Streamlit. This will enable end-users, irrespective of their technical proficiency, to harness the combined might of LLMs and structured databases, all with a simple chat interface.Building an Interactive Chatbot with Streamlit, OpenAI, and LangChainIn the age of user-centric design, chatbots represent an ideal solution for user engagement. But while typical chatbots can handle queries, imagine a chatbot powered by both Streamlit for its front end and a Large Language Model (LLM) for its backend intelligence. This powerful union allows us to provide dynamic, intelligent responses, leveraging our stored financial data to answer user queries.The grand finale of our setup is the Streamlit application. This intuitive web interface allows users to converse with our chatbot in natural language, making database queries feel like casual chats. Behind the scenes, it leverages the power of the SQL agent, tapping into the real-time financial data stored in our database, and presenting users with instant, accurate insights.Let's break down our chatbot's core functionality, designed using Streamlit:import streamlit as st from streamlit_chat import message from streamlit_extras.colored_header import colored_header from streamlit_extras.add_vertical_space import add_vertical_space from utils import * # Now the Streamlit app # Sidebar contents with st.sidebar:    st.title('Financial QnA Engine')    st.markdown('''    ## About    This app is an LLM-powered chatbot built using:    - Streamlit    - Open AI Davinci LLM Model    - LangChain    - Finance    ''')    add_vertical_space(5)    st.write('Running in Docker!') # Generate empty lists for generated and past. ## generated stores AI generated responses if 'generated' not in st.session_state:    st.session_state['generated'] = ["Hi, how can I help today?"] ## past stores User's questions if 'past' not in st.session_state:    st.session_state['past'] = ['Hi!'] # Layout of input/response containers input_container = st.container() colored_header(label='', description='', color_name='blue-30') response_container = st.container() # User input ## Function for taking user provided prompt as input def get_text():    input_text = st.text_input("You: ", "", key="input")    return input_text ## Applying the user input box with input_container:    user_input = get_text() # Response output ## Function for taking user prompt as input followed by producing AI generated responses def generate_response(prompt):    response = query_db(prompt)    return response ## Conditional display of AI generated responses as a function of user provided prompts with response_container:    if user_input:        response = generate_response(user_input)        st.session_state.past.append(user_input)        st.session_state.generated.append(response)          if st.session_state['generated']:        for i in range(len(st.session_state['generated'])):            message(st.session_state['past'][i], is_user=True, key=str(i) + '_user',avatar_style='identicon',seed=123)            message(st.session_state["generated"][i], key=str(i),avatar_style='icons',seed=123)The key features of the application are in general:Sidebar Contents: The sidebar provides information about the chatbot, highlighting the technologies used to power it. With the aid of streamlit-extras, we've added vertical spacing for visual appeal.User Interaction Management: Our chatbot uses Streamlit's session_state to remember previous user interactions. The 'past' list stores user queries, and 'generated' stores the LLM-generated responses.Layout: With Streamlit's container feature, the layout is neatly divided into areas for user input and the AI response.User Input: Users interact with our chatbot using a simple text input box, where they type in their query.AI Response: Using the generate_response function, the chatbot processes user input, fetching data from the database using LangChain and LLM to generate an appropriate response. These responses, along with past interactions, are then dynamically displayed in the chat interface using the message function from streamlit_chat.Now in order to ensure portability and ease of deployment, our application is containerized using Docker. Below is the Dockerfile that aids in this process:FROM python:3 WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["streamlit", "run", "streamlit_app.py"]This Dockerfile:Starts with a base Python 3 image.Sets the working directory in the container to /app.Copies the requirements.txt into the container and installs the necessary dependencies.Finally, it copies the rest of the application and sets the default command to run the Streamlit application.Our application depends on several Python libraries which are specified in requirements.txt:streamlit streamlit-chat streamlit-extras mysql-connector-python openai==0.27.8 langchain==0.0.225These include:streamlit: For the main frontend application.streamlit-chat & streamlit-extras: For enhancing the chat interface and adding utility functions like colored headers and vertical spacing.mysql-connector-python: To interact with our MySQL database.openai: For accessing OpenAI's Davinci model.langchain: To bridge the gap between LLMs and our SQL database.Our chatbot application combines now a user-friendly frontend interface with the immense knowledge and adaptability of LLMs.We can verify that the values provided by the chatbot actually reflect the data in the database:In the next section, we'll dive deeper into deployment strategies, ensuring users worldwide can benefit from our financial Q&A engine.ConclusionAs we wrap up this comprehensive guide, we have used from raw financial data to a fully-fledged, AI-powered chatbot, we've traversed a myriad of technologies, frameworks, and paradigms to create something truly exceptional.We initiated our expedition by setting up a MySQL database brimming with financial data. But the real magic began when we introduced LangChain to the mix, establishing a bridge between human-like language understanding and the structured world of databases. This amalgamation ensured that our application could pull relevant financial insights on the fly, using natural language queries. Streamlit stood as the centerpiece of our user interaction. With its dynamic and intuitive design capabilities, we crafted an interface where users could communicate effortlessly. Marrying this with the vast knowledge of OpenAI's Davinci LLM, our chatbot could comprehend, reason, and respond, making financial inquiries a breeze. To ensure that our application was both robust and portable, we harnessed the power of Docker. This ensured that irrespective of the environment, our chatbot was always ready to assist, without the hassles of dependency management.We've showcased just the tip of the iceberg. The bigger picture is captivating, revealing countless potential applications. Deploying such tools in firms could help clients get real-time insights into their portfolios. Integrating such systems with personal finance apps can help individuals make informed decisions about investments, savings, or even daily expenses. The true potential unfolds when you, the reader, take these foundational blocks and experiment, innovate, and iterate. The amalgamation of LLMs, databases, and interactive interfaces like Streamlit opens a realm of possibilities limited only by imagination.As we conclude, remember that in the world of technology, learning is a continuous journey. What we've built today is a stepping stone. Challenge yourself, experiment with new data sets, refine the user experience, or even integrate more advanced features. The horizon is vast, and the opportunities, endless. Embrace this newfound knowledge and craft the next big thing in financial technology. After all, every revolution starts with a single step, and today, you've taken many. Happy coding!Author BioAlan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn
Read more
  • 0
  • 0
  • 16792

article-image-large-language-models-llms-and-knowledge-graphs
Mostafa Ibrahim
15 Nov 2023
7 min read
Save for later

Large Language Models (LLMs) and Knowledge Graphs

Mostafa Ibrahim
15 Nov 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionHarnessing the power of AI, this article explores how Large Language Models (LLMs) like OpenAI's GPT can analyze data from Knowledge Graphs to revolutionize data interpretation, particularly in healthcare. We'll illustrate a use case where an LLM assesses patient symptoms from a Knowledge Graph to suggest diagnoses, showcasing LLM’s potential to support medical diagnostics with precision.Brief Introduction Into Large Language Models (LLMs)Large Language Models (LLMs), such as OpenAI's GPT series, represent a significant advancement in the field of artificial intelligence. These models are trained on vast datasets of text, enabling them to understand and generate human-like language.LLMs are adept at understanding complex questions and providing appropriate responses, akin to human analysis. This capability stems from their extensive training on diverse datasets, allowing them to interpret context and generate relevant text-based answers.While LLMs possess advanced data processing capabilities, their effectiveness is often limited by the static nature of their training data. Knowledge Graphs step in to fill this gap, offering a dynamic and continuously updated source of information. This integration not only equips LLMs with the latest data, enhancing the accuracy and relevance of their output but also empowers them to solve more complex problems with a greater level of sophistication. As we harness this powerful combination, we pave the way for innovative solutions across various sectors that demand real-time intelligence, such as the ever-fluctuating stock market.Exploring Knowledge Graphs and How LLMs Can Benefit From ThemKnowledge Graphs represent a pivotal advancement in organizing and utilizing data, especially in enhancing the capabilities of Large Language Models (LLMs).Knowledge Graphs organize data in a graph format, where entities (like people, places, and things) are nodes, and the relationships between them are edges. This structure allows for a more nuanced representation of data and its interconnected nature. Take the above Knowledge Graph as an example.Doctor Node: This node represents the doctor. It is connected to the patient node with an edge labeled "Patient," indicating the doctor-patient relationship.Patient Node (Patient123): This is the central node representing a specific patient, known as "Patient123." It serves as a junction point connecting to various symptoms that the patient is experiencing.Symptom Nodes: There are three separate nodes representing individual symptoms that the patient has: "Fever," "Cough," and "Shortness of breath." Each of these symptoms is connected to the patient node by edges labeled "Symptom," indicating that these are the symptoms experienced by "Patient123.          To simplify, the Knowledge Graph shows that "Patient123" is a patient of the "Doctor" and is experiencing three symptoms: fever, cough, and shortness of breath. This type of graph is useful in medical contexts where it's essential to model the relationships between patients, their healthcare providers, and their medical conditions or symptoms. It allows for easy querying of related data—for example, finding all symptoms associated with a particular patient or identifying all patients experiencing a certain symptom.Practical Integration of LLMs and Knowledge GraphsStep 1: Installing and Importing the Necessary LibrariesIn this step, we're going to bring in two essential libraries: rdflib for constructing our Knowledge Graph and openai for tapping into the capabilities of GPT, the Large Language Model.!pip install rdflib !pip install openai==0.28 import rdflib import openaiStep 2: Import your Personal OPENAI API KEYopenai.api_key = "Insert Your Personal OpenAI API Key Here"Step 3: Creating a Knowledge Graph# Create a new and empty Knowledge graph g = rdflib.Graph() # Define a Namespace for health-related data namespace = rdflib.Namespace("http://example.org/health/")Step 4: Adding data to Our GraphIn this part of the code, we will introduce a single entry to the Knowledge Graph pertaining to patient124. This entry will consist of three distinct nodes, each representing a different symptom exhibited by the patient.def add_patient_data(patient_id, symptoms):    patient_uri = rdflib.URIRef(patient_id)      for symptom in symptoms:        symptom_predicate = namespace.hasSymptom        g.add((patient_uri, symptom_predicate, rdflib.Literal(symptom))) # Example of adding patient data add_patient_data("Patient123", ["fever", "cough", "shortness of breath"])Step 5: Identifying the get_stock_price functionWe will utilize a simple query in order to extract the required data from the knowledge graph.def get_patient_symptoms(patient_id):    # Correctly reference the patient's URI in the SPARQL query    patient_uri = rdflib.URIRef(patient_id)    sparql_query = f"""        PREFIX ex: <http://example.org/health/>        SELECT ?symptom        WHERE {{            <{patient_uri}> ex:hasSymptom ?symptom.        }}    """    query_result = g.query(sparql_query)    symptoms = [str(row.symptom) for row in query_result]    return symptomsStep 6: Identifying the generate_llm_response functionThe generate_daignosis_response function takes as input the user’s name along with the list of symptoms extracted from the graph. Moving on, the LLM uses such data in order to give the patient the most appropriate diagnosis.def generate_diagnosis_response(patient_id, symptoms):    symptoms_list = ", ".join(symptoms)    prompt = f"A patient with the following symptoms - {symptoms_list} - has been observed. Based on these symptoms, what could be a potential diagnosis?"      # Placeholder for LLM response (use the actual OpenAI API)    llm_response = openai.Completion.create(        model="text-davinci-003",        prompt=prompt,        max_tokens=100    )    return llm_response.choices[0].text.strip() # Example usage patient_id = "Patient123" symptoms = get_patient_symptoms(patient_id) if symptoms:    diagnosis = generate_diagnosis_response(patient_id, symptoms)    print(diagnosis) else:    print(f"No symptoms found for {patient_id}.")Output: The potential diagnosis could be pneumonia. Pneumonia is a type of respiratory infection that causes symptoms including fever, cough, and shortness of breath. Other potential diagnoses should be considered as well and should be discussed with a medical professional.As demonstrated, the LLM connected the three symptoms—fever, cough, and shortness of breath—to suggest that patient123 may potentially be diagnosed with pneumonia.ConclusionIn summary, the collaboration of Large Language Models and Knowledge Graphs presents a substantial advancement in the realm of data analysis. This article has provided a straightforward illustration of their potential when working in tandem, with LLMs to efficiently extract and interpret data from Knowledge Graphs.As we further develop and refine these technologies, we hold the promise of significantly improving analytical capabilities and informing more sophisticated decision-making in an increasingly data-driven world.Author BioMostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.Medium
Read more
  • 0
  • 0
  • 16177

article-image-fine-tuning-llama-2
Prakhar Mishra
06 Nov 2023
9 min read
Save for later

Fine-Tuning LLaMA 2

Prakhar Mishra
06 Nov 2023
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLarge Language Models have recently become the talk of the town. I am very sure, you must have heard of ChatGPT. Yes, that’s an LLM, and that’s what I am talking about. Every few weeks, we have been witnessing newer, better but not necessarily larger LLMs coming out either as open-source or closed-source. This is probably the best time to learn about them and make these powerful models work for your specific use case.In today’s blog, we will look into one of the recent open-source models called Llama2 and try to fine-tune it on a standard NLP task of recognizing entities from text. We will first look into what are large language models, what are open-source and closed-source models, and some examples of them. We will then move to learning about Llama2 and why is it so special. We then describe our NLP task and dataset. Finally, we get into coding.About Large Language Models (LLMs)Language models are artificial intelligence systems that have been trained to understand and generate human language. Large Language Models (LLMs) like GPT-3, ChatGPT, GPT-4, Bard, and similar can perform diverse sets of tasks out of the box. Often the quality of output from these large language models is highly dependent on the finesse of the prompt given by the user.These Language models are trained on vast amounts of text data from the Internet. Most of the language models are trained in an auto-regressive way i.e. they try to maximize the probability of the next word based on the words they have produced or seen in the past. This data includes a wide range of written text, from books and articles to websites and social media posts. Language models have a wide range of applications, including chatbots, virtual assistants, content generation, and more. They can be used in industries like customer service, healthcare, finance, and marketing.Since these models are trained on enormous data, they are already good at zero-shot inference and can be steered to perform better with few-shot examples. Zero-shot is a setup in which a model can learn to recognize things that it hasn't explicitly seen before in training. In a Few-shot setting, the goal is to make predictions for new classes based on the few examples of labeled data that is provided to it at inference time.Despite their amazing capabilities of generating text, these humongous models come with a few limitations that must be thought of when building an LLM-based production pipeline. Some of these limitations are hallucinations, biases, and more.Closed and Open-source Language ModelsLarge language models from closed-source are those employed by some companies and are not readily accessible to the public. Training data for these models are typically kept private. While they can be highly sophisticated, this limits transparency, potentially leading to concerns about bias, and data privacy.In contrast, open-source projects like GPT-3, are designed to be freely available to researchers and developers. These models are trained on extensive, publicly available datasets, allowing for a degree of transparency and collaboration.The decision between closed- and open-source language models is influenced by several variables, such as the project's goals, the need for openness, and others.About LLama2Meta's open-source LLM is called Llama 2. It was trained with 2 trillion "tokens" from publicly available sources like Wikipedia, Common Crawl, and books from the Gutenberg project. Three different parameter level model versions are available, i.e. 7 billion, 13 billion, and 70 billion parameter models. There are two types of completion models available: Chat-tuned and General. The chat-tuned models that have been fine-tuned for chatbot-like dialogue are denoted by the suffix '-chat'. We will use general Meta's 7b Llama-2 huggingface model as the base model that we fine-tune. Feel free to use any other version of llama2-7b.Also, if you are interested, there are several threads that you can go through to understand how good is Llama family w.r.t GPT family is - source, source, source.About Named Entity RecognitionAs a component of information extraction, named-entity recognition locates and categorizes specific entities inside the unstructured text by allocating them to pre-defined groups, such as individuals, organizations, locations, measures, and more. NER offers a quick way to understand the core idea or content of a lengthy text.There are many ways of extracting entities from a given text, in this blog, we will specifically delve into fine-tuning Llama2-7b using PEFT techniques on Colab Notebook.We will transform the SMSSpamCollection classification data set for NER. Pretty interesting 😀We search through all 10 letter words and tag them as 10_WORDS_LONG. And this is the entity that we want our Llama to extract. But why this bizarre formulation? I did it intentionally to show that this is something that the pre-trained model would not have seen during the pre-training stage. So it becomes essential to fine-tune it and make it work for our use case 👍. But surely we can add logic to our formulation - think of these words as probable outliers/noisy words. The larger the words, the higher the possibility of it being noise/oov. However, you will have to come up with the extract letter count after seeing the word length distribution. Please note that the code is generic enough for fine-tuning any number of entities. It’s just a change in the data preparation step that we will make to slice out only relevant entities.Code for Fine-tuning Llama2-7b# Importing Libraries from transformers import LlamaTokenizer, LlamaForCausalLM import torch from datasets import Dataset import transformers import pandas as pd from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_int8_training, get_peft_model_state_dict, PeftModel from sklearn.utils import shuffleData Preparation Phasedf = pd.read_csv('SMSSpamCollection', sep='\t', header=None)  all_text = df[1].str.lower().tolist()  input, output = [], []  for text in all_text:               input.append(text)               output.append({word: '10_WORDS_LONG' for word in text.split() if len(word)==10}) df = pd.DataFrame([input, output]).T df.rename({0:'input_text', 1: 'output_text'}, axis=1, inplace=True) print (df.head(5)) total_ds = shuffle(df, random_state=42) total_train_ds = total_ds.head(4000) total_test_ds = total_ds.tail(1500) total_train_ds_hf = Dataset.from_pandas(total_train_ds) total_test_ds_hf = Dataset.from_pandas(total_test_ds) tokenized_tr_ds = total_train_ds_hf.map(generate_and_tokenize_prompt) tokenized_te_ds = total_test_ds_hf.map(generate_and_tokenize_prompt) Fine-tuning Phase# Loading Modelmodel_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) def create_peft_config(m): peft_cofig = LoraConfig( task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.05, target_modules=['q_proj', 'v_proj'], ) model = prepare_model_for_int8_training(model) model.enable_input_require_grads() model = get_peft_model(model, peft_cofig) model.print_trainable_parameters() return model, peft_cofig model, lora_config = create_peft_config(model) def generate_prompt(data_point): return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Extract entity from the given input: ### Input: {data_point["input_text"]} ### Response: {data_point["output_text"]}""" tokenizer.pad_token_id = 0 def tokenize(prompt, add_eos_token=True): result = tokenizer( prompt, truncation=True, max_length=128, padding=False, return_tensors=None, ) if ( result["input_ids"][-1] != tokenizer.eos_token_id and len(result["input_ids"]) < 128 and add_eos_token ): result["input_ids"].append(tokenizer.eos_token_id) result["attention_mask"].append(1) result["labels"] = result["input_ids"].copy() return result def generate_and_tokenize_prompt(data_point): full_prompt = generate_prompt(data_point) tokenized_full_prompt = tokenize(full_prompt) return tokenized_full_prompt training_arguments = transformers.TrainingArguments(  per_device_train_batch_size=1, gradient_accumulation_steps=16,  learning_rate=4e-05,  logging_steps=100,  optim="adamw_torch",  evaluation_strategy="steps",  save_strategy="steps",  eval_steps=100,  save_steps=100,  output_dir="saved_models/" ) data_collator = transformers.DataCollatorForSeq2Seq(tokenizer) trainer = transformers.Trainer(model=model, tokenizer=tokenizer, train_dataset=tokenized_tr_ds, eval_dataset=tokenized_te_ds, args=training_arguments, data_collator=data_collator) with torch.autocast("cuda"):       trainer.train()InferenceLoaded_tokenizer = LlamaTokenizer.from_pretrained(model_name) Loaded_model = LlamaForCausalLM.from_pretrained(model_name, load_in_8bit=True, torch.dtype=torch.float16, device_map=’auto’) Model = PeftModel.from_pretrained(Loaded_model, “saved_model_path”, torch.dtype=torch.float16) Model.config.pad_tokeni_id = loaded_tokenizer.pad_token_id = 0 Model.eval() def extract_entity(text):   inp = Loaded_tokenizer(prompt, return_tensor=’pt’).to(“cuda”)   with torch.no_grad():       P_ent = Loaded_tokenizer.decode(model.generate(**inp, max_new_tokens=128)[0], skip_special_tokens=True)       int_idx = P_ent.find(‘Response:’)       P_ent = P_ent[int_idx+len(‘Response:’):]   return P_ent.strip() extracted_entity = extract_entity(text) print (extracted_entity) ConclusionWe covered the process of optimizing the llama2-7b model for the Named Entity Recognition job in this blog post. For that matter, it can be any task that you are interested in. The core concept that one must learn from this blog is PEFT-based training of large language models. Additionally, as pre-trained LLMs might not always perform well in your work, it is best to fine-tune these models.Author BioPrakhar Mishra has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn
Read more
  • 0
  • 0
  • 16143
article-image-making-the-best-out-of-hugging-face-hub-using-langchain
Ms. Valentina Alto
17 Jun 2023
6 min read
Save for later

Making the best out of Hugging Face Hub using LangChain

Ms. Valentina Alto
17 Jun 2023
6 min read
Since the launch of ChatGPT in November 2022, everyone is talking about GPT models and OpenAI. There is no doubt that the Generative Pre-trained Transformers (GPT) architecture developed by OpenAI has demonstrated incredible results, also given the investments in training (almost 500 billion tokens) and complexity of the model (175 billion parameters for the GPT-3).Nevertheless, there is an incredible number of open-source Large Language Models (LLMs) that have been widespread in the last months. Below are some examples:Dolly: 12 billion parameters LLM developed by Databricks and trained on their ML platform. Source codeàhttps://github.com/databrickslabs/dollyStableML: This is a series of LLM developed by StabilityAI, the company behind the popular image generation model Stable Diffusion. The series encompasses a variety of LLMs, some of which are fine-tuned on specific use cases. Source codeàhttps://github.com/Stability-AI/StableLMFalcon LLM: A 40 billion parameters LLM developed by the Technology Innovation Institute and trained on a particularly high-quality dataset called RefinedWeb. Plus, as for now (June 2023) ranks 1 globally in the latest Hugging Face independent verification of open-source AI models. Source codeà https://huggingface.co/tiiuaeGPT NeoX and GPT-J: An open-source reproduction of the OpenAI’s GPT series developed by Eulether AI, with respectively 40 and 6 billion parameters. Source codeà https://huggingface.co/EleutherAI/gpt-neox-20b and https://huggingface.co/EleutherAI/gpt-j-6bOpenLLaMa: As for the previous class of models, also this one is an open-source reproduction of Meta AI’s LLaMA and has 3.7 billion parameters. Source codeàhttps://github.com/openlm-research/open_llamaIf you are interested in getting deeper into those models and their performance, you can reference the Hugging Face leaderboard here.Image1: Hugging Face Leaderboard Now, LLMs are great, yet to unlock their real power we need them to be positioned within an applicative logic. In other words, we want our LLMs to infuse intelligence within our applications.For this purpose, we will be using LangChain, a powerful lightweight SDK which makes it easier to integrate and orchestrate LLMs within applications. LangChain is one of the most popular LLMs orchestrators, yet if you want to explore further packages I encourage you to read about Semantic Kernel and Jarvis.One of the nice things about LangChain is its integration with external tools: those might be OpenAI (and other LLMs vendors), data sources, search APIs, and so on. In this article, we are going to explore how LangChain makes it easier to leverage open-source LLMs by leveraging its integration with the Hugging Face Hub.Welcome to the realm of open source LLMsThe Hugging Face Hub serves as a comprehensive platform comprising more than 120k models, 20kdatasets, and 50k demo apps (Spaces), all of which are openly accessible and shared as open-source projects. It provides an online environment where developers can effortlessly collaborate and collectively develop machine learning solutions. Thanks to LangChain, it is way easier to start interacting with open-source LLMs. Plus, you can also surround those models with all the libraries provided by LangChain in terms of prompt design, memory retention, chain management, and so on.Let’s see an implementation with Python. To reproduce the code, make sure to have: Python 3.7.1 or higheràyou can check your Python version running python --version in your terminalLangChain installedàyou can install it via pip install langchainThe huggingface_hub Python package installedà you can install it via pip install huggingface_butHugging Face Hub API keyàto get the API key, you can register into the portal here and then generate your secret key.For this example, I’m going to use the lightest version of Dolly, developed by Databricks and available in three sizes: 3, 7, and 12 billion parameters.from langchain import HuggingFaceHub from getpass import getpass HUGGINGFACEHUB_API_TOKEN = "your-api-key" import os os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN repo_id = " databricks/dolly-v2-3b" llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":64})As you can see from the above code, the only information we need are our Hugging Face Hub API key and the model’s repo ID; then, LangChain will take care of initializing our model thanks to the direct integration with Hugging Face Hub.Now that we have initialized our model, it is time to define the structure of the prompt:from langchain import PromptTemplate, LLMChain template = """Question: {question} Answer: Let's think step by step.""" prompt = PromptTemplate(template=template, input_variables=["question"]) llm_chain = LLMChain(prompt=prompt, llm=llm)Finally, we can feed our model with a first question:question = "In the first movie of Harry Potter, what is the name of the three-headed dog? “ print(llm_chain.run(question)) Output: The name of the three-headed dog in Harry Potter and the Philosopher Stone is Fuffy.Even though I tested the light version of Dolly with “only” 3 billion parameters, it came with pretty accurate results. Of course, for more complex tasks or real-world projects, heavier models might be taken into consideration, like the one emerging as top performers in the Hugging Face leaderboard mentioned at the beginning of this article.ConclusionThe realm of open-source LLM is growing exponentially, and this creates a vibrant environment of experimentation and tuning from which anyone can benefit. Plus, some interesting trends are rising, like the reduction of the number of models’ parameters in favour of an increase in quality of the training dataset. In fact, we saw that the current top performer among the open source model is Falcon LLM, with “only” 40 billion parameters, which gained its strength from the high-quality training dataset. Finally, with the development of orchestration frameworks like LangChain and similar, it’s getting easier and easier to leverage open source LLMs and integrate them into our applications. Referenceshttps://huggingface.co/docs/hub/indexOpen LLM Leaderboard — a Hugging Face Space by HuggingFaceH4Hugging Face Hub — 🦜🔗 LangChain 0.0.189Overview (huggingface.co)stabilityai (Stability AI) (huggingface.co)Stability-AI/StableLM: StableLM: Stability AI Language Models (github.com)Author BioValentina Alto graduated in 2021 in data science. Since 2020, she has been working at Microsoft as an Azure solution specialist, and since 2022, she has been focusing on data and AI workloads within the manufacturing and pharmaceutical industries. She has been working closely with system integrators on customer projects to deploy cloud architecture with a focus on modern data platforms, data mesh frameworks, IoT and real-time analytics, Azure Machine Learning, Azure Cognitive Services (including Azure OpenAI Service), and Power BI for dashboarding. Since commencing her academic journey, she has been writing tech articles on statistics, machine learning, deep learning, and AI in various publications and has authored a book on the fundamentals of machine learning with Python.Author of the book: Modern Generative AI with ChatGPT and OpenAI ModelsLink - Medium  LinkedIn  
Read more
  • 0
  • 0
  • 15487

article-image-detecting-and-mitigating-hallucinations-in-llms
Ryan Goodman
25 Oct 2023
10 min read
Save for later

Detecting and Mitigating Hallucinations in LLMs

Ryan Goodman
25 Oct 2023
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn large language models, the term "hallucination" describes a behavior where an AI model produces results that are not entirely accurate or might sound nonsensical. It's important to understand that large language models are not search engines or databases. They do not search for information from external sources or perform complex computations. Instead, large language models (LLM) belong to a category of generative artificial intelligence.Recap How Generative AI WorksGenerative AI is a technology trained on large volumes of data and, as a result, can " generate" text, images, and even audio. This makes it fundamentally different from search engines and other software tools you might be familiar with. This foundational difference presents challenges, most notably that generative AI can’t cite sources for its responses. Large language models are also not designed to solve computational problems like math. However, generative AI can quickly generate code that might solve complex mathematical challenges. A large language model responds to inputs, most notably the text instruction called a "prompt." As the large language model generates text, it uses its training data as a foundation to extrapolate information.Understanding HallucinationsThe simplest way to understand a hallucination is the old game of telephone. In the same way, a message gets distorted in the game of telephone; information can get "distorted" or "hallucinated" as a language model tries to generate outputs based on patterns it observed in its training data. The model might "misremember" or "misinterpret" certain information, leading to inaccuracies.Let's use another model example to understand the concept of generating unique combinations of words in the context of food recipes. Imagine you want to create new recipes by observing existing ones. If you were to build a Markov model for food ingredients, you would:1.  Compile a comprehensive dataset of recipes and extract individual ingredients.2.   Create pairs of neighboring ingredients, like "tomato-basil" and "chicken-rice," and record how often each pair occurs.For example, if you start with the ingredient "chicken," you might notice it's frequently paired with "broccoli" and "garlic" but less so with "pineapple." If you then choose "broccoli" as the next ingredient, it might be equally likely to be paired with "cheese" or "lemon." By following these ingredient pairings, at some point, the model might suggest creative combinations like "chicken-pineapple-lemon," offering new culinary ideas based on observed patterns.This approach allows the Markov model to generate novel recipe ideas based on the statistical likelihood of ingredient pairings.Hallucinations as a FeatureWhen researching or computing factual information, a hallucination is a bad thing. However, the same concept that gets a bad rap for accurate information or research is what makes large language models demonstrate another human condition of creativity. As a developer, if you want to make your language model creative, OpenAI, for example, has a "temperature" input, a hyperparameter that makes the model's outputs more random. A high temperature of 1 or above will result in hallucinations and randomness. For example, a lower temperature of .2 will make the modern outputs more deterministic to match patterns it was trained on.  As an experiment, try inputting a prompt to any large language model chatbot, including ChatGPT, to provide a plot of any romantic story without copying existing accounts on the internet, a new storyline, and new characters. The LLM will offer a fictitious story with characters, a plot, multiple acts, an arc, and an ending.In specific scenarios, end users or developers might intentionally coax their large language models into a state of "hallucination. When seeking out-of-the-box ideas to think beyond its training, you can get abstract ideas. In this scenario, the model's ability to "hallucinate" isn't a bug but rather a feature. To continue the experiment, you can return to ChatGPT and ask it to pretend you have changed the temperature hyperparameter to 1.1 and re-write the story. Your results will be very “creative.”In creative pursuits, like crafting tales or penning poems, these so-called "hallucinations" aren't just tolerated; they're celebrated. They can add depth, surprise, and innovation layers to the generated content.Types of hallucination One can categorize hallucinations into different forms.Intrinsic hallucination happens to contradict the source material directly. It also offers logical inconsistency and factual inaccuracies.Extrinsic hallucination does not contradict. However, at the same time, it cannot be verified against any source. Hence, it adds elements that are considered to be unconfirmable and speculative. Detecting hallucinations Detecting hallucinations in the large language models is a tricky task. LLMs will deliver information with the same tone and certainty even if the answer is unknown. It puts the responsibility of users and developers to be careful about how information from LLMs is used.The following techniques can be utilized to uncover or measure hallucinations in large language models.Identify the grounding dataGrounding data is the standard against which the Large Language Model (LLM) output is measured. The selection of grounding data depends on the specific application. For example, real job resumes could be grounding data for generating resume-related content. Conversely, search engine outcomes could be employed for web-based inquiries. Especially in language translation, the choice of grounding data is pivotal for accurate translation. For example, official legal documents could serve as grounding data for legal translations, ensuring precision in the translated content.Create a measurement test setA measurement test data set comprises input/output pairs incorporating human interactions and the Large Language Model (LLM). These datasets often include various input conditions and their corresponding program outputs. These sets may involve simulated interactions between users and software systems, depending on the scenario.Ideally, there should be a minimum of two kinds of test sets:1. A standard or randomly generated test set that would be conventional but cater to diverse scenarios.2. An adversarial test set is created through performance in edge cases, high-risk situations, or when presented with deliberately misleading or tricky inputs, even security threats.Extract any claimsFollowing the preparation of test data sets, the subsequent stage involves extracting assertions from the Large Language Model (LLM). This extraction can occur manually, through rule-based methodologies, or even by employing machine learning models.In data analysis, the next step is to extract specific patterns from the data after gathering the datasets. This extraction can be done manually or through predefined rules, basic descriptive analytics, or, for large-scale projects, machine learning algorithms. Each method has its merits and drawbacks, which we will thoroughly investigate.Use validations against any grounding dataValidation guarantees that the content generated by the Large Language Model (LLM) corresponds to the grounding data. Frequently, this stage replicates the techniques employed for data extraction.To support the above, here is the code snippet of the same.# Define grounding data (acceptable sentences) grounding_data = [    "The sky is blue.",    "Python is a popular programming language.",    "ChatGPT provides intelligent responses." ] # List of generated sentences to be validated generated_sentences = [    "The sky is blue.",    "ChatGPT is a popular programming language.",    "Python provides intelligent responses." ] # Validate generated sentences against grounding data valid_sentences = [sentence for sentence in generated_sentences if sentence in grounding_data] # Output valid sentences print("Valid Sentences:") for sentence in valid_sentences:    print("- " + sentence) # Output invalid sentences invalid_sentences = list(set(generated_sentences) - set(valid_sentences)) print("\nInvalid Sentences:") for sentence in invalid_sentences:    print("- " + sentence)Output: Valid Sentences: - The sky is blue. Invalid Sentences: - ChatGPT is a popular programming language. - Python provides intelligent responses. Furthermore, in verifying research findings, validation ensures that the conclusions drawn from the research align with the collected data. This process often mirrors the research methods employed earlier.Metrics reportingThe "Grounding Defect Rate" is a crucial metric that measures the proportion of responses lacking context to the total generated outputs. Further metrics will be explored later for a more detailed assessment. For instance, the "Error Rate" is a vital metric indicating the percentage of mistranslated phrases from the translated text. Additional metrics will be introduced later for a comprehensive evaluation.A Multifaceted Approach to Mitigate Hallucination in the Large Language Model Leveraging product designThe developer needs to employ large language models so that it does not create material issues, even when it hallucinates. For example, you would not design an application that writes your annual report or news articles. Instead, opinion pieces or content summarization within a prompt can immediately lower the risk of problematic hallucination.If an app allows AI-generated outputs to be distributed, end users should be able to review and revise the content. It adds a protective layer of scrutiny and puts the responsibility into the hands of the user.Continuous improvement and loggingPersisting prompts and LLM output are essential for auditing purposes. As models evolve, you cannot count on prompting an LLM and getting the same result. However, regression testing and reviewing user input are critical as long as it adheres to data, security, and privacy practice.Prompt engineeringIt is essential to get the best possible output to use the concept of meta prompts effectively. A meta prompt is a high-level instruction given to a language model to guide its output in a specific direction. Rather than asking a direct question, provide context, structure, and guidance to refine the output.For example, instead of asking, "What is photosynthesis?", you can ask, "Explain photosynthesis in simple terms suitable for a 5th-grade student." This will adjust the complexity and style of the answer you get.Multi-Shot PromptsMulti-shot prompts refer to a series of prompts given to a language model, often in succession. The goal is to guide the model step-by-step toward a desired output instead of asking for a large chunk of information in a single prompt.  This approach is extremely useful when the required information is complex or extensive. Typically, these prompts are best delivered as a chat user experience, allowing the user and model to break down the requests into multiple, manageable parts.ConclusionThe issue of hallucination in Large Language Models (LLMs) presents a significant hurdle for consumers, users, and developers. While overhauling the foundational architecture of these models isn't a feasible solution for most, the good news is that there are strategies to navigate these challenges. But beyond these technical solutions, there's an ethical dimension to consider. As developers and innovators harness the power of LLMs, it's imperative to prioritize disclosure and transparency. Only through openness can we ensure that LLMs integrate seamlessly into our daily lives and gain the trust and acceptance they require to revolutionize our digital interactions truly.Author BioRyan Goodman has dedicated 20 years to the business of data and analytics, working as a practitioner, executive, and entrepreneur. He recently founded DataTools Pro after 4 years at Reliant Funding, where he served as the VP of Analytics and BI. There, he implemented a modern data stack, utilized data sciences, integrated cloud analytics, and established a governance structure. Drawing from his experiences as a customer, Ryan is now collaborating with his team to develop rapid deployment industry solutions. These solutions utilize machine learning, LLMs, and modern data platforms to significantly reduce the time to value for data and analytics teams.
Read more
  • 0
  • 0
  • 15296
Modal Close icon
Modal Close icon