Skip to content


Let's walk-through an example of how to scale a FastAPI application together !


Here we use FastAPI to show how easy it is to integrate gibbs in an asynchronous framework, but gibbs can be used like any asynchronous python code !

Initial application

Let's take a simple example to see how we can scale with gibbs.

Say we have developed a great ML model. For the simplicity of this example, here is the code of a dummy model :

import time

class MyAwesomeModel:
    def __init__(self, wait_time=0.25):
        self.w = wait_time

    def __call__(self, x):
        return x**2

This model simply return the squared input, after simulating a certain processing time.

Now, having a model is great, but we want to make it available to our users. To do that, we create an API using FastAPI, serving that model. Here is the code :

import time

import uvicorn
from fastapi import FastAPI

class MyAwesomeModel:
    def __init__(self, wait_time=0.25):
        self.w = wait_time

    def __call__(self, x):
        return x**2

# Instanciate FastAPI app and instanciate our model
app = FastAPI()
model = MyAwesomeModel()

# Define a route that will call our model and return the result
async def simple_request(x: int):
    return {"result": model(x)}

if __name__=="__main__":
    # Run the app, host="", port=8000)

You can run this python script and access http://localhost:8000/docs to try the route by yourself.

Great ! We are serving our awesome model !

The scaling issue

This code is great, but it does not scale.

Because our model takes 250ms to deal with every request, you can imagine what happen when 10 clients send one request at the same time... One of the client will have to wait 2.5s before receiving a response !

You can try this out by starting our simple app, and in another terminal, run the following script :

import multiprocessing as mp
import time

import requests

def req_process(i):
    r = requests.get(f"http://localhost:8000/request?x={i}")
    assert r.status_code == 200
    return r.json()

def time_parallel_requests(n):
    with mp.Pool(n) as p:
        t0 = time.time(), range(n))
        t1 = time.time()

    return t1 - t0

if __name__ == "__main__":
    t = time_parallel_requests(10)
    print(f"It tooks {t:.3f}s to process 10 requests")

This script simply run 10 requests in parallel and print the time necessary to complete all of them. And as expected :

It tooks 2.532s to process 10 requests

How gibbs works

What we want is simply to have pool of several models, and when one model is busy dealing with a request, instead of waiting for it to finish, we want to call another (idle) model.

So we can deal with several requests in parallel, and therefore serve several clients with a low latency !

To achieve this, gibbs introduces 2 classes :

  • Hub
  • Worker

The Worker class is just a process, dealing with requests sequentially by calling the awesome model you created.

The Hub is the class that orchestrate the requests, sending each request to the right worker (currently idle).


You can see a more detailed description of how this work in Architecture

Use gibbs to scale up

Let's see how to modify our simple app to scale up.

We simply have to create a Hub and use it to send requests, and start a few workers with our awesome model !

import time

import uvicorn
from fastapi import FastAPI

from gibbs import Hub, Worker

class MyAwesomeModel:
    def __init__(self, wait_time=0.25):
        self.w = wait_time

    def __call__(self, x):
        return x**2

# Instanciate FastAPI app and instanciate the Hub
app = FastAPI()
hub = Hub()

# Define a route that will call our model and return the result
async def simple_request(x: int):
    return {"result": await hub.request(x)}

if __name__=="__main__":
    # Start the workers (in another process)
    workers = [Worker(MyAwesomeModel) for _ in range(4)]
    for w in workers:

    # Run the app, host="", port=8000)

Quite simple, right ?

Now, if we use the same script as before to run 10 requests in parallel in another terminal :

It tooks 0.855s to process 10 requests

The time needed to deal with 10 requests is greatly reduced, by sharing the work between the 4 workers !