Creating a Transformer-Based Web Application Firewall

In this blog, we will build a Transformer-based Web Application Firewall from scratch using Docker, Nginx, FastAPI, and HuggingFace.

Firewalls are the network security devices that monitor and control the outgoing and ingoing networks. They are used often in the cybersecurity tasks to block, allow or restrict the access of traffic. Think of it as a security guard standing at the main gate of your society, he takes care of who are entering the colony. Interesting, right? In this post, I want to share my experience of creating a web application firewall.

Cyberattacks are so unpredictable, and hackers are highly adaptive with the new security protocols. We need a firewall which is as smart as the hackers. Something that can predict the next move and block it. And in today's world, whenever we think of predictions, the thing that comes in our mind is - machine learning. So, we are now creating a transformer-based firewall for web application.

But why transformers? Simple, they generalize, rather than memorize. Their property of self-attention helps them to learn the concepts more accurately. Their adaptability and their proved accuracy make them a perfect fit for cybersecurity.

I will try to explain the reason of every step and give the code at the end so that you can focus on trying by yourself rather than just copy pasting. If you need more explanation about any topic, feel free to mail me at - priyanshi9085@gmail.com



Getting started

The basic flow of the architecture will look like:

Client Request
     ↓
Nginx (auth_request)
     ↓
ML Model (FastAPI)
     ↓
Allow (2xx) → Forward to App
Block (403) → Stop request

We will be integrating nginx for reverse proxy and security gate to handle the traffic effectively. We will map the port of the app with the nginx using docker. 

Tech stack 

  • Docker
  • nginx 
  • Transformer 
  • Huggingface,
  • Google colab (for training the model)
  • FastAPI. 
The web application will be using is the OWASP Juice shop. This app is generally used as a vulnerable site for learning the attacking and defending techniques. Have a look at its structure: OWASP Juice Shop.

So, before we start make sure you have all the following things installed on your system:
  1. Python >3.13: Download Python | Python.org
  2. Docker desktop: Windows | Docker Docs 

Setting up the App container on Docker

  1. Open the terminal and run the juice shop container in detached mode (-d), otherwise the container will begin running foreground and your terminal will be filled with logs of it. Let me help you out:
    1. Docker commands start with 'docker'
    2. To run the container, we use 'run'
    3. Detached mode '-d'. 
    4. Docker uses default names like 'nostalgic_curie' or 'interesting_turing' etc. so give a simple name using '--name {name of the container you like}'. 
    5. The port mapping is done using '-p'. Port mapping looks like host port: container port. For the app, we are mapping the same ports '3000:3000'.
    6. The image name of the juice shop is bkimminich/juice-shop (format is author/image). The images are stored in the docker hub which is like a Github for docker containers.
//run this in your terminal

docker run -d --name juice-shop -p 3000:3000 bkimminich/juice-shop

  1. The above steps will start and run the juice shop container on port 3000. You can verify at http://localhost:3000.
This step sets up the juice shop app completing the first step of the project.

Training the model

If your computer has a dedicated GPU or powerful CPU, you can go with training on your PCs. But I will personally recommend using Google Colab for better speed and efficiency. 
The integration of nginx, docker and transformer will lead to a higher latency, we will be using Distilroberta-base for this project. You can find the dataset such as CSIC 2010 from the internet for the training or can generate it by yourself. I generated it because the CSIC 2010 is an old dataset, and we want to train the model for advanced hackers and their tactics.
Before proceeding to training, clean the data into a csv format of processed_log and classification, where processed_log are the logs from the users (cleaned and normalized) and classification is 0 or 1 for normal or anomalous respectively.
I used PyTorch for the training. Keep the max_length parameter in tokenizer function as 256 and truncation = true.
Remember to store the artifacts and push them to huggingface hub for the next step. 

# Google colab


from transformers import AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
import numpy as np

#path where the model will be saved.
model_path = './waf_roberta'

#load tokenizer for distilroberta-base (converts text -> token)
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

# Load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
                                 'distilroberta-base',
                    num_labels = 2) #for binary classification (0 and 1)


#Creating a function to tokenize each input sample
def tokenize_function(example):
    return tokenizer(
        example["text"], #input text column
        truncation = True, #to reduce the lengths of inputs
        padding = True,     #pad shorter tokens for batching
        max_length = 256    #max token length (important for latency)
    )

#Load the dataset (mine is pipe-separated, yours can have other separators.)
df = pd.read_csv('dataset.csv', sep = "|")

#Splitting the data for training (80%) and testing (20%)
x_train, x_test, y_train, y_test = train_test_split(
    df['processed_log'],
    df['classification'],
   test_size= 0.2,
    random_state = 42, # for reproducibility
    stratify=df['classification'] # maintain class balance
)



#Convert the data to huggingface format
df_train = Dataset.from_dict({
    "text":list(x_train), #input text
    "label":list(y_train)   #corresponding labels
}).map(tokenize_function, batched = True) #calling the tokenizer function.


df_test = Dataset.from_dict({
    "text":list(x_test),
    "label":list(y_test)
}).map(tokenize_function, batched = True)

#Set the dataset format for pytorch (required by Trainer)
df_train.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
df_test.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])



# Function to evaluate model.
def compute_metrics(eval_pred):
    logits, labels = eval_pred #model outputs and true results
    preds = np.argmax(logits, axis = 1) #choose class with highest probability
    return {
        "accuracy": accuracy_score(labels, preds), #overall correctness
        "f1_score": f1_score(labels, preds, average = 'macro')
                                            #balance between precision and recall
    }

# Handles dynamic padding during training (efficient batching)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Training configuration
training_args = TrainingArguments(
    output_dir="./waf_roberta", #Store the model in this folder
    per_device_train_batch_size=8, #batch size for training.
    per_device_eval_batch_size=8, #batch size for testing
    num_train_epochs=2, #number of passes
    learning_rate=5e-5, #learning rate
    weight_decay=0.01, #regularization to avoid overfitting
    logging_steps=10000, #frequency of logging
    eval_strategy="epoch", #evaluate after each epoch
    save_strategy="epoch", #save after each epoch
    load_best_model_at_end=True,
    metric_for_best_model="f1_score", #return the best model base on F1 score.
    push_to_hub=True #ensure that you have logged in using login() to hub
)


# Trainer API simplifies training loop, evaluation, and logging
trainer = Trainer(
model=model, # model to train
args=training_args, # training configuration
train_dataset=df_train, # training data
eval_dataset=df_test, # validation data
compute_metrics=compute_metrics, # evaluation metrics
data_collator=data_collator # padding strategy
)


# Start training process
trainer.train()


# Save trained model locally
trainer.save_model(model_path)

# Save tokenizer (needed during inference)
tokenizer.save_pretrained(model_path)

Comments are there for explanation, hope these are enough for this project. For deeper understanding, refer the transformers and pytorch docs.

Next step, check the accuracy score

from sklearn.metrics import classification_report, confusion_matrix

predictions = trainer.predict(df_test)
logits = predictions.predictions
y_true = predictions.label_ids
y_pred = np.argmax(logits , axis = 1)

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))


In cybersecurity, if false negatives are a bad user experience, then false positives are a threat. Correct the dataset if any of the value is not up to the benchmark.
This step is the most important and the hardest one. Now we want this trained model to work when someone tries to navigate to/in the app. Let's move to the next step of building APIs using FastAPI.


API Building

Okay developers, I know you were missing your code editor, time to open it.
Create a function that evaluate log through the trained model and returns the malicious probability and prediction (i.e. 0 if malicious probability is less than 0.5).

Install pytorch, transformers, fastapi and uvicorn.

# process.py



import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("
                                                    priyanshisalujaaa112/waf_roberta")

tokenizer = AutoTokenizer.from_pretrained("priyanshisalujaaa112/waf_roberta")

# change the path to yours. Don't cheat!

model.eval()

def process(log):
    inputs = tokenizer(
        log,
        return_tensors="pt",
        truncation=True,
        padding=False,
        max_length=256
    )

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)

    malicious_prob = probs[0][1].item()
    prediction = int(malicious_prob > 0.5)

    return {
        "prediction": prediction,
        "malicious_probability": malicious_prob
    }


Now we can write the API /ml_check that will call process() for checking the request.

# app.py


from fastapi import FastAPI, Request, Response
from pydantic import BaseModel
import joblib
from process import process

def transformer_predict(data):
    prob = process(data)
    if (prob['malicious_probability'] > 0.92):
        return 403
    else:
        return 200
   


@app.api_route("/ml_check", methods=["GET", "POST", "PUT", "DELETE", "PATCH"])
async def ml_check(request: Request):
    method = request.headers.get("x-original-method", "")
    path = request.headers.get("x-original-uri", "")
    user_agent = request.headers.get("user-agent", "")

    processed_log = f"{method} {path} {user_agent}"
    status = transformer_predict(processed_log)
    return Response(status_code=status)


Change the response to match your processed logs. 
This step completes the building of the API. Next step is the 'hero step' of this project - Integrating Nginx with the FastAPI and Docker. 

Integration of Nginx

Nginx (engine-x, I know you were pronouncing it wrong) acts like the security gate, it will send all the traffic to API (/ml_check) and returns the resulted status code. 
So, we have two tasks
  1. Forward the traffic of Juice shop to nginx.
  2. Integrate the API with nginx.
Create a nginx.conf file where we will be writing the configurations for nginx.

//nginx.conf

# Events block: controls how Nginx handles connections (low-level tuning)

events {
    # For now, we are not defining anything here
    # (can be used later for worker_connections, etc.)
}  


# HTTP block: contains all web server configurations
http {

    server {
        # Nginx will listen on port 80 (default HTTP port)
        listen 80;        


        # Main route: handles all incoming user requests
        location / {

            # Before forwarding request, call ML check endpoint
            # This acts like a "security gate"
            auth_request /ml_check;

            # If ML check passes → forward request to actual application (Juice Shop)
            proxy_pass http://host.docker.internal:3000;
        }


        # Internal endpoint for ML-based validation
        location /ml_check {

            # Prevent direct access from users (security measure)
            internal;

            # Forward request to FastAPI ML service
            proxy_pass http://host.docker.internal:8000/ml_check;


            # Do NOT send request body to ML model (only headers needed)
            proxy_pass_request_body off;

            # Clear Content-Length since body is removed
            proxy_set_header Content-Length "";


            # Send original request info to ML model for analysis

            # Full request URI (e.g., /login?user=admin)
            proxy_set_header X-Original-URI $request_uri;

            # HTTP method (GET, POST, etc.)
            proxy_set_header X-Original-Method $request_method;

            # User-Agent (browser/client info)
            proxy_set_header User-Agent $http_user_agent;
        }
    }
}
 

And done! The shorter the code, the deeper it is (not always). Now comes the next step, including this configuration file in the docker mount and run the container of nginx.
The command remains the same as the juice shop container i.e. docker run -d --name waf-nginx -p 80:80. This time the port is 80, which is the default port of HTTP. Now we want the docker to read this nginx.conf file so we will mount it in its volume. The syntax for that looks like current_path: docker path.
//in your terminal opened in the location of nginx.conf

docker run -d --name waf-nginx -p 80:80 -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx  
 

Testing

We have finally reached to the final step of our project. Let's test what we have made. 
Do these three steps every time you want to test this project.
    1. Run the juice_shop container
    2. Run the nginx container
    3. Run the backend (uvicorn)
 
Open http://localhost and you will see the app running normally. Check in your backend terminal you will see status code 2xx.


Now change the url to something like "http://localhost/?q=' UNION 1=1 --" or you can check in your terminal using 'curl "http://localhost/?q=' UNION 1=1 --" ', you will see 403 forbidden error.





Conclusion

Building this project gave me a deeper understanding of how modern cybersecurity systems can go beyond traditional rule-based approaches. Instead of relying only on predefined signatures, we introduced a learning-based system that can adapt to patterns in real traffic.

By combining Nginx as a reverse proxy, Docker for containerization, FastAPI for inference, and a Transformer model for classification, we created a modular and scalable Web Application Firewall. Each component has a clear responsibility — Nginx handles traffic, the ML model makes decisions, and the application serves the user.

One important realization during this project was the trade-off between security and performance. Adding an ML layer increases latency, which is why we used a lighter model like DistilRoBERTa and limited input size. In real-world systems, further optimizations such as caching, batching, or asynchronous processing would be necessary.

This project is just a starting point. There are many ways to improve it:

  • Expanding the dataset with more diverse and real-world attack patterns

  • Reducing false positives and false negatives through better training

  • Deploying the system on cloud infrastructure for scalability

  • Introducing real-time logging and monitoring dashboards

At its core, this project demonstrates how machine learning and system design can come together to build smarter security systems. As attackers evolve, our defenses must evolve too — and this is one step in that direction. You can get the complete code in my github repo: Priyanshi585-na/transformWAF.

I am ready for a star too by the way if you like it.

If you made it this far, I encourage you to try building it yourself, experiment with the model, and break things along the way — that’s where the real learning happens.



Comments