Deploying Fine-Tuned GPT-2 Model Using Fast API

A comprehensive step-by-step guide for beginners

14 min readJan 22, 2023

What does it mean to deploy models?

*Fig. Deployment Pipeline (Image by Author)*

As per Cambridge Dictionary, the word “deploy” means “to use something/someone, especially in an effective way”. Similarly, the engineering task of making the machine learning model ready for use by the users in a production environment is known as model deployment.

Let us assume you have finetuned a language model for the task of text generation and now, you want to flex among your friends about your super cool model, what to do? Sharing the model weights with everyone and setting up code on your friend’s computer will be very cumbersome. There will be dependency issues and module installations might take time making it an unfeasible way of doing it. Ideally, the easiest way would be to have a shareable link that everyone can use to see the cool results of your super-cool model. This is what model deployment is all about.

The most popular way of deploying models in production is using an API. An API endpoint is defined to serve the model for a given POST request. The request body contains the required input for model inference.

If you are deploying a computer vision model, then an image will be sent as a part of the request body.

If you are deploying an NLP model, then the input text will be sent as a part of the request body.

In our case, since we are deploying a language model for text generation, we will include the prompt text in the request body.

For this article, We will be using the model finetuned on the Quotes-500K dataset with a GPT-2 base model for the task of automatically generating motivational quotes given a short prompt. The model was finetuned using Google Colab for one epoch (quite computationally expensive!) and the perplexity of the model is 15.18.

Perplexity is an evaluation metric for language models which tells you how well a probability model predicts a sample. A low perplexity value signifies the good performance of the probability model. If you are interested in understanding the perplexity metric in detail, you may refer to this article: Perplexity in Language Models.

You can find the scripts and colab notebook for finetuning the GPT-2 model in this GitHub repository.

Uploading the finetuned model to HuggingFace Hub

We will be uploading our finetuned model to HuggingFace Hub to make the model inference easy and efficient. Model Inference is the process of streaming live data through Machine Learning models to generate an output. Generally, the model weights are very heavy making it extremely difficult to upload them to free cloud services (free tier quota!) and paid ones are too heavy on the pocket! (sigh)

Enter HuggingFace Hub as the saviour!

Built and maintained by the pioneers of the NLP industry HuggingFace, it has made the lives of AI practitioners easier by providing Git-based repositories to upload model files. Anyone can share and explore models in HuggingFace Hub. Currently, the repository supports three different formats of file storage:

Model Files
Datasets
Spaces (Demo apps)

These repositories provide all the facilities of a version control system making it easier to manage different versions of models. You can create different branches as well for different versions of models. Just like the GitHub repository, you have the option to make your models either public or private. Other advantages of using HuggingFace Hub include:

The uploaded models are supported for inference tasks in different frameworks like Tensorflow, PyTorch, etc.
Metadata related to models can be added to the model cards (like Readme.md in GitHub repositories)

Uploading models to HuggingFace Hub can be done in three different ways:

Using Web Interface
Using CLI
Using python scripts

Method one is the easiest way of doing it and is beginner-friendly as well. Hence, we will be following it. Let’s get our hands dirty!!

Create the model repository

Head over to https://huggingface.co/new to create a model repository.

Provide a model name — Be creative, this will be the unique identity of the model
Choose model owner — It can be either you or any organization you are a part of
Select license — You can choose the license under which you want to serve your model like Creative Commons, GPL, BSD, etc. Refer to Licenses for more information.
Visibility of model — Choose whether your model will be public or private

*Fig. Create a new model repository (Image by Author)*

Upload model weights and other files to the repository

Add model weights, tokenizers, etc to the repository using the “Add File” button under the “Files and versions” tab.
– Browse and select all the files from your computer for uploading
– Add an appropriate commit message and click on the “Commit changes” button.

*Fig. Model repositories with files (Image by Author)*

If you click on the “Use in Transformers” button on the top-right, a pop-up will open and you can copy the code snippet for importing the model using the transformers library.

*Fig. Pop-up of “Use in transformers” (Image by Author)*

You can import the model in any python script and the model will be downloaded when you execute the script. So while building the API we can directly use this link for model inference instead of the model weights.

Update the model card

The model card allows you to share information about the uploaded model weights with the public.

Click on the “Edit model card” button under the Model card tab.

*Fig. Edit model card button UI (Image by Author)*

The markup syntax followed in Readme.md can be followed in the Model card as well. At the beginning of the Model card, the metadata for the model can be specified. A sample for the same can be found below:

*Fig. Metadata sample (Image by Author)*

You can learn more about creating good model cards from the free course offered by Hugging Face: Building a Model card.

That’s it! We are done with Step 1 of Model Deployment. You can find my uploaded model at Quote-Generator. If you visit this page, you will be able to see the Download trend of the model as well as a tab for model inference API to test the model in the browser itself by providing input text prompts.

*Fig. Download trend and Inference API (Image by Author)*

As a deliverable at the end of this article, we are aiming to build something similar to “Hosted inference API” without the pretty form UI.

Creating an API using FastAPI

Understanding API

API stands for Application Programming Interface, which is a software program that lets your application communicate with other software with the help of some pre-defined protocols. The developers don’t have to know/understand how the software interfaced using API works or has been implemented to use it. All they need to understand is what are the different endpoints of the API for different services and what input parameters need to be passed to the API. Many organizations make their products available through APIs (free and paid both) that can be interfaced with any development framework for other organizations to use them.

For example, applications using Google Translate are using API that has been developed by Google on a subscription basis. They don’t know how Google Translate works or how it has been developed by Google engineers.

When it comes to integrating machine learning models in applications, APIs are the way to go. Engineering teams build APIs to serve machine learning models and calls to these APIs are added to the main application. The response payload from these APIs is then used in the decision-making process or returning the output.

Many frameworks can be used to build APIs. Some of the most popular frameworks include:

Flask
Django
FastAPI

For this project, we will be building API using FastAPI.

Introduction to FastAPI

FastAPI is a fast, high-performance and easy way of creating APIs using Python for deployment. Developed by Sebastian Ramirez in 2018, FastAPI gained popularity among the tech giants in a short span. Currently, organizations like Microsoft, Netflix, etc use this framework to build APIs.

You can create a production-level API with just a few lines of code. The library is well-documented, so if you get stuck anywhere while building APIs, don’t worry! FastAPI documentation has got you covered. Another highlight of using FastAPI that distinguishes it from other API frameworks is the automatic and interactive API documentation for testing. One does not have to learn how to use Postman to test the API.

FastAPI provides you with two options to call and test your API directly from the browser: Swagger UI & ReDoc.

Going into more technical details, FastAPI works on an ASGI web server that supports asynchronous requests allowing the system to scale without much difficulty. This functionality has been built using Starlette. To know more about Starlette, you can refer to the official documentation. This is another feature that sets FastAPI apart from its WSGI counterparts which only support synchronous requests.

For data validation and management, FastAPI uses the PyDantic regime meaning the response object returned from an API request can be directly saved in the Database. The validation of data will be done by PyDantic automatically. You can use the built-in data validators or define your custom validators as well. You can learn more about it from the official documentation.

Step-By-Step Guide: Creating your first FastAPI

Getting Started: Installation

To build API using FastAPI, you need to install two modules:

*Fig. 10 Module Installations (Image by Author)*

Getting Started: Creating Script

*Fig. Folder Structure (Image by Author)*

Create a file called “main.py” in the project folder. This is the python script where we will write the API’s code.

As you can see in the above code block, we have imported and created an instance of the FastAPI class. FastAPI class has all the functionalities that developers need to build an API and hence, the instance “app” of this class will be used everywhere in the script for interaction with the web server. In fact, at the time of running the API, we use this instance variable in the command.

Now we have the instance of FastAPI, what should we do next? Create endpoints for API.

API Endpoints

The most obvious question now is what is an endpoint?

In simple words, we will create a URL path and define a function that will be triggered in the backend whenever a user visits that URL. The triggered function can perform any action; as simple as sending a simple text to the web server or as complex as saving the data in a linked database.

There are multiple methods defined by the HTTP protocol that can be used to perform different actions. The most commonly used methods are:

GET — Read the data
POST — Create the data
UPDATE — Update the data
DELETE — Delete the data

With every endpoint, we have to specify the method which will be used to communicate with the HTTP protocols.

Create Testing API Endpoint

Let’s start with creating a testing endpoint first.

In the above code block, @app.get(“/”) can be broken into the following components:

@app: It is a decorator function that makes an ordinary python function respond to a web request. Here, the app is the FastAPI class instance that we created. If you are interested in understanding how decorator functions are different from ordinary python functions, you may check out Python Decorators. The concept has been explained in detail with examples.
get(“/”): Here get is the method and the / inside parenthesis is the URL. Since there is nothing after /, it means the home page. You can add pathname as well if you wish after /. E.g., @app.get(‘/home’)
root(): Python function that will be triggered in the backend when the user visits the above URL.
return {key:value}: API usually returns JSON/dictionary-like object as a response.

To run this API on your local system, you need to execute:

uvicorn main:app — reload

To break this down further,

main: This is the name of the python file
app: This is the FastAPI class instance we created
— reload: This will make your API reload even while execution if any code changes are made

When you execute the above command, you should be able to see something like shown below:

*Fig. Running the API (Image by Author)*

If you follow the URL in the output, you will be able to see something like shown below:

Create a class for Query Generator Model

QueryGenerator Class

To integrate our Quote Generator model into this API, we will first define a custom class whose instance methods will accept the user prompt and generate the quote using our model.

As you can see in the above code block, I have created a class called “QuoteGenerator”. In the constructor of the class, the following class variables have been defined:

quote_generator: This is a transformer pipeline for the task of text generation called TextGenerationPipeline. There are many advantages of using transformer pipelines.
tokenizer: We are downloading the tokenizer of the model from the HuggingFace hub. You can replace this with your model name.
model: We are downloading the model weights from the HuggingFace hub. You can replace it with your model name.
default_prompts: In case the user is willing to provide any prompt for generating the quote, we will randomly choose any prompt from this list.

Pipeline Objects: Transformer Library

We have then defined an instance method called load_generator where we initialise our text generation pipeline. The first parameter ‘text-generation’ passed to the function is to define the task to be performed. The next parameters are the class variables self.model and self.tokenizer which have been initialized in the constructor of the class with model weights and tokenizer respectively.

The pipeline objects of the transformers library make it very easy to infer a model for a variety of NLP tasks such as text classification, named entity recognition, summarization, etc. The steps of preprocessing the input, inference by model and postprocessing of the output have been clubbed into one single function. So you need not write any complex code for it; just use pipeline functions.

Now that our pipeline accepts the input prompt and generates text based on that is ready, let us write the instance methods that will utilize the aforementioned pipeline.

As you can see in the code snippet above, there is one clean_text function that can be used to perform basic cleaning of text before we input it to the pipeline function for the text generation task.

Hyperparameters of model

The function generate_quote is the main function where our pipeline is running to generate text. To walk you through the code, the function is accepting some hyperparameters of the pipeline as parameters that can be passed to tweak the default values.

min_length: The minimum length of the text that is to be generated by the pipeline.
max_length: The maximum permissible length of the text to be generated by the pipeline
top_p: This parameter is to set the threshold for the top p decoding sampling technique.

Apart from the hyperparameters that are accepted as function parameters, some other hyperparameters are also passed to the pipeline as follows:

top_k: This parameter is to set the threshold for the top k decoding sampling technique.
temperature: This is a hyperparameter to tune the sampling technique used by the pipeline to predict the next token for a given set of tokens.

While decoding sampling techniques is beyond the scope of this article to explain briefly, various algorithms have proposed to select the token that should follow a given set of tokens. It is one of the most important steps of the text generation task and impacts the quality of the text generated by the language model. Some of the most popular decoding algorithms used are:

Greedy approach
Beam search
Top K
Top P

To learn more about decoding algorithms, please refer to this wonderful resource: Decoding Strategies that you need to know for Response Generation

Passing the input text to model

We check if the prompt sent by the user is an empty string or not. If it is then use the random.randint() function, we are randomly selecting a prompt from the default_prompts list and pass the prompt stored in the prompt_start variable to our pipeline along with hyperparameters for prediction.

In the next step, we are extracting the generated text from the model output. We are accessing the index 0 as we only passed single text input to the pipeline. Transformer pipelines are capable of processing text data batch-wise as well. The generated text is cleaned using the clean_text function and then returned.

Integrating FastAPI and QueryGenerator Class

We will now integrate the code into the FastAPI script we created earlier.

Extending the same code which we wrote to create test endpoints, we will first import the QuoteGenerator class and create an instance of the same. We will call the load_generator() function using the class instance created which will download the models from HuggingFace Hub and initialize the text generation pipeline.

Just like the root endpoint we have created using the app decorator, we will create another endpoint for our API. The only difference is that this time we will use the POST method instead of the GET method as we will be sending the user text prompt to the server.

The python function generate_quote_for_user accepts an input parameter called prompt which is coming with the web request. We finally call the generate_quote method using the previously created class instance of QuoteGenerator with the user prompt. The generated text is returned by the function.

Testing the API

To test your API locally, run the API using the following command uvicorn main:app — reload

Then visit the interactive docs using the URL http://127.0.0.1:8000/docs. You might have to wait for a bit as the models and tokenizers will be downloaded when the API will be executed for the first time. The above URL will take you to the Swagger UI of FastAPI for testing the API within the browser itself. You should be able to see a screen like shown below.

*Fig. 14 Swagger UI for testing FastAPI (Image by Author)*

If you click on the green tab with POST and /generate_quote endpoint, it will expand and you should be able to see something like shown below.

*Fig. 15 Swagger UI: Testing an endpoint (Image by Author)*

You can see there is a text field to accept the text prompt. But currently, it is disabled, you will have to click on the button “Try it out” on the top right to enable the text field. You can enter the prompt in the text field and click on the “Execute” button to send a request to the API.

*Fig. 16 Sending request to the API with text prompt (Image by Author)*

You should be able to see the output of your model below in the Responses tab.

The text generated by the API looks quite apt: “The only way to be truly happy is to have a purpose”.

*Fig. 17 The response of API (Image by Author)*

Conclusion

The model is not perfect of course. If we give it an input such as “life is not a”, the text generated by the model is “Life is not a word that is spoken in a language other than english.”. Syntactically and grammatically, the sentence is correct but it doesn’t make sense. :P

At the same time, it can generate text as deep and meaningful as this “The most important thing you can do for a child is to teach them how to be human.” for the input “the most important thing”.

This would be all for this article, I hope you enjoyed reading it and were able to take away something worthwhile from it.

Happy Learning!

Until Next Time, Take care!

~ Nandini Bansal