In the swiftly evolving landscape of artificial intelligence, fine-tuning large language models (LLMs) like GPT-3.5 has become a critical endeavor for developers looking to tailor these powerful tools to specific applications. At our company, we embarked on a journey to fine-tune GPT-3.5, a process that presented unique challenges and invaluable lessons. Here’s a detailed account of our experience, the obstacles we faced, and the strategies that led us to success.
Our first step was selecting the appropriate version of GPT-3.5. After careful consideration, we opted for the "GPT-3.5-turbo-0613" variant, which provided a balance between computational efficiency and token limit adherence. The model's token limit was a critical factor since it dictated the amount of data we could process in one go.
Initially, our approach to dataset creation was fundamentally flawed. We tried splitting single interactions between AI and humans into separate examples. This method backfired spectacularly, leading to a model that behaved erratically and failed to grasp the context of conversations. Recognizing this mistake was a turning point. We revised our dataset to encapsulate entire conversations as single examples, which drastically improved the model’s performance and alignment with our goals.
After extensive research on formatting data for fine-tuning GPT-3.5 Turbo, we decided to structure our dataset based on the assistant-user role interactions, a common format required for GPT-3.5 training. Our data, sourced from PDF transcripts, needed significant conversion. We began by extracting the text from PDFs into plain text files. Subsequent pre-processing steps were applied to refine the text, which was then structured into CSV format. The final step involved converting the CSV files into JSONL format, as required by the base model. To ensure effective model training and validation, we strategically split the data: transcripts with fewer pages were designated for validation, while those with more extensive content were used for training. This methodological approach not only optimized our dataset but also enhanced the model's learning efficacy.
In our first training session, we overlooked including a validation file, yet the model still began to train with surprisingly decent outcomes. For our subsequent training, we incorporated a validation file which refined our results further. Using OpenAI’s Playground, we could monitor real-time metrics for both training and validation loss, which was instrumental in fine-tuning the model’s performance. However, as we continued to train the model with more data, it started to deviate from expected behaviors. This anomaly led us to a crucial realization about fine-tuning LLMs: unlike smaller models, LLMs are already trained on vast datasets. Excessive fine-tuning can inadvertently overwrite the existing knowledge base rather than refining it, particularly if the new data is voluminous.
Armed with this new understanding, we adjusted our fine-tuning strategy to focus primarily on modifying the model's tone rather than its knowledge base. This approach required much less data and aligned better with our objectives, leading to significant improvements in how the model responded.
Ultimately, we deployed our fine-tuned model on Azure. This platform posed its own set of challenges, notably the stricter token limits and lack of real-time loss metrics during fine-tuning, which meant we couldn't halt the process mid-way if the losses were excessive. Despite these hurdles, we managed to deploy successfully but remained mindful of the costs associated with fine-tuning in terms of data volume, the number of epochs, and the frequency of tuning. And Surprise! Azure deployment cost us around 1400 dollars in a single month. We had deployed 2 models, one base and one fine-tuned. Without giving us a single warning beforehand, our budget had lost more than 2000 dollars. This is something to be very careful of when deploying to Azure. Therefore, it is always a good practice to check the Azure model pricing before deploying any type of model. https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/#pricing
One of the most challenging aspects of our fine-tuning journey was devising a robust evaluation method for our models. Evaluating the performance of a fine-tuned model requires a benchmark against which to measure its efficacy. We explored various frameworks such as OpenAI's eval, Langchain, and human-eval, all of which necessitate a ground truth for accurate assessment. However, creating a reliable ground truth proved to be a formidable task.
After thorough research and deliberation, we sought advice from OpenAI's experts. Surprisingly, they revealed that even within OpenAI, model evaluation often relies on custom chat models created with GPT-4. Armed with this insight, we pivoted our approach to prioritize the creation of a robust ground truth dataset.
We developed several custom chat models, each capable of generating ground truth files. These files were then used to evaluate responses from our fine-tuned model, the base model, Gemini, and Claude, serving as our benchmark models. Through meticulous evaluation, we determined that our fine-tuned model consistently outperformed the other models, validating the effectiveness of our fine-tuning process.
Our journey through fine-tuning GPT-3.5 Turbo has been a testament to the complexities and nuances of working with large language models. We've encountered and overcome challenges in dataset formatting, model evaluation, and training strategy, each contributing to a deeper understanding of the fine-tuning process.
One of our key realizations was the critical importance of dataset formatting. Converting our PDF transcripts into the assistant-user role format required for GPT-3.5 training was a meticulous process that significantly impacted the model's performance. Similarly, the evaluation phase posed its own set of challenges, with the need to create a reliable ground truth for accurate assessment.
Through these challenges, we've learned that fine-tuning an LLM is not just about updating its knowledge base but also about refining its tone and behavior. Real-time metrics and iterative adjustments played a crucial role in shaping our model's behavior, allowing us to achieve the desired outcomes efficiently.
As we reflect on our experience, we recognize the value of sharing our journey with the community. We hope that our insights into dataset formatting, evaluation, and training strategies will serve as a road-map for others embarking on similar endeavors. Whether you're a seasoned AI practitioner or new to the field, these lessons can help you navigate the complexities of fine-tuning large language models and optimize your AI implementations for success.