Make FastAPI CPU-bound Endpoints 2X Faster

FastAPI is built on top of Starlette, which is an ASGI framework. That makes it ideal for building async web services in Python.

Async code can be more efficient only when it comes to IO operations like reading from a database – that's like 99% of web use cases. But you might be surprised to know that even for non-IO operations, async functions will perform better in FastAPI.

Consider this application:

from fastapi import FastAPI

app = FastAPI()

@app.get("/nonasync")
def nonasync_endpoint():
    return "Hello Non-Async"

@app.get("/async")
async def async_endpoint():
    return "Hello Async"

The /nonasync endpoint is bound to a regular Python function, and the /async endpoint is bound to a coroutine.

I'm running the Uvicorn server in production mode:

uvicorn main:app --port 3000

And load test using Apache HTTP server benchmarking tool:

$ ab -n 1000 -c 100 http://localhost:3000/async

Requests per second:    13190.52
$ ab -n 1000 -c 100 http://localhost:3000/nonasync

Requests per second:    7103.43

I ran these tests on an Intel i9-13900K CPU.

The async endpoint is 85% faster than non-async, even though there's no IO.

Async vs non-async comparison

Why?

We keep hearing there's no benefit in using async code when there's no IO. Why isn't it true here?

This is a call chain starting from a FastAPI route handler when the endpoint is not a coroutine:

The answer is anyio.to_thread.run_sync. It allows running a blocking function in a worker thread. It lets the event loop continue running other tasks while the worker thread runs the blocking call. When a function is not a coroutine, FastAPI uses Starlette's run_in_threadpool to execute the endpoint function, which ultimately uses anyio.to_thread.run_sync.

Running a function in a worker thread and waiting until it's done adds overhead for each API call.

Use More CPU

I added some dummy operations to make sure the endpoints represent a real CPU-intensive task:

def cpu_intensive_task():
    c = 0
    for i in range(0, 1000000):
        c += 1

@app.get("/nonasync")
def nonasync_endpoint():
    cpu_intensive_task()
    return "Hello Non-Async"

@app.get("/async")
async def async_endpoint():
    cpu_intensive_task()
    return "Hello Async"

And the result is even more surprising:

$ ab -n 1000 -c 100 http://localhost:3000/async

Requests per second:    31.47
$ ab -n 1000 -c 100 http://localhost:3000/nonasync

Requests per second:    15.56

The async endpoint is more than 2 times faster in this case.

Multi-worker

Out of curiosity, I ran Uvicorn with 32 workers since my CPU has 32 threads.

uvicorn main:app --port 3000 --workers 32

And the results are:

$ ab -n 1000 -c 100 http://localhost:3000/async

Requests per second:    454.22
$ ab -n 1000 -c 100 http://localhost:3000/nonasync

Requests per second:    367.90

The async endpoint is 23% faster this time. The reason is that there are more worker processes available, and given the number of concurrent requests is the same, fewer threads are needed per process. With fewer threads, there's less context switching and overhead for each process, which ultimately reduces the cumulative overhead of run_in_threadpool.

Conclusion

Always use async functions to define your FastAPI endpoints unless you're dealing with blocking IO, no matter if they're CPU-bound or IO.