Background Jobs in NestJS with BullMQ: A Complete Walkthrough
Move slow, occasionally-flaky work off the request thread — retries with backoff, durable status tracking, and idempotent re-enqueueing, built on NestJS, BullMQ, and PostgreSQL.
Senior Developer

Calling a slow, occasionally-unreliable upstream API — an LLM provider, a payment processor, a third-party webhook — directly inside a request handler ties that request's fate to the upstream call's fate. If it's slow, the client waits on the full round trip. If it fails, the handler has to decide on the spot whether to fail the whole request or quietly fall back to something else, which is exactly the kind of decision that's easy to get wrong under pressure and hard to notice when it goes wrong silently.
This post builds a background job pipeline in NestJS using BullMQ: a request that enqueues work and returns immediately, a worker that processes it with automatic retries, and a durable record a client can poll for the outcome. It's demonstrated on an async draft-generation endpoint — the same shape as a real LLM-backed content pipeline — with a couple of details about BullMQ's actual runtime behavior that are easy to get wrong if you're going off the docs alone. The full project, tested end-to-end against real Postgres and Redis, is linked at the end.
The shape of the problem
A synchronous version of this endpoint looks simple:
@Post('drafts')
async create(@Body() dto: CreateDraftDto) {
const content = await this.llmProvider.generate(dto.topic); // could take 5-30s, could fail
return this.draftsRepository.save({ topic: dto.topic, content });
}
The request now blocks for as long as the LLM call takes, and a single transient failure (a timeout, a rate limit, a brief provider outage) becomes a failed request with nothing to show for it — no record that it was attempted, nothing to retry, nothing to inspect afterward.
The fix is to decouple "accept the request" from "do the work":
@Post('drafts')
create(@Body() dto: CreateDraftDto) {
return this.draftsService.enqueue(dto); // returns in milliseconds
}
enqueue() writes a pending row and hands the job to BullMQ. A separate worker process picks it up, retries it automatically on failure, and updates that row when it's done — successfully or not.
Two sources of truth, on purpose
This implementation keeps two records of a job's state, deliberately:
BullMQ's own state, in Redis — which attempt it's on, when it'll retry next, its position in the queue. This is operational state, and it's normal for it to get cleaned up after a job finishes (
removeOnComplete/removeOnFail).A Postgres row, written by the application —
pending→processing→completed/failed, plus the result or failure reason. This is what survives a Redis flush, what a client actually polls, and what you'd query for "show me every failed draft from last week."
export enum DraftJobStatus {
PENDING = 'pending',
PROCESSING = 'processing',
COMPLETED = 'completed',
FAILED = 'failed',
}
@Entity()
export class DraftJob {
@PrimaryGeneratedColumn('uuid')
id: string;
@Column()
topic: string;
@Column({ type: 'varchar', default: DraftJobStatus.PENDING })
status: DraftJobStatus;
@Column({ type: 'text', nullable: true })
result: string | null;
@Column({ type: 'text', nullable: true })
failureReason: string | null;
@Column({ default: 0 })
attemptsMade: number;
@CreateDateColumn()
createdAt: Date;
@Column({ type: 'timestamp', nullable: true })
completedAt: Date | null;
}
Conflating these two — treating BullMQ's Redis-backed job as the only record — means losing history the moment a job gets cleaned up, and gives clients nothing stable to poll against.
Wiring BullMQ into the Nest app
Before any of that, BullMQ needs a Redis connection — configured once, at the root module:
// app.module.ts
BullModule.forRootAsync({
inject: [ConfigService],
useFactory: (configService: ConfigService) => ({
connection: {
host: configService.get<string>('REDIS_HOST'),
port: configService.get<number>('REDIS_PORT'),
},
}),
}),
Then, inside whichever feature module actually uses a queue, that queue gets declared by name:
// drafts.module.ts
BullModule.registerQueue({ name: 'draft-generation' }),
That string, 'draft-generation', is the thread tying three separate places together: it's what registerQueue declares here, what @InjectQueue('draft-generation') asks for in the service below, and what @Processor('draft-generation') listens on in the worker. All three have to use the exact same name — there's no compiler check enforcing that, since it's just a string. The two ways this can go wrong behave very differently, and it's worth knowing which is which rather than assuming:
If
@InjectQueuedoesn't match anythingregisterQueuedeclared, Nest's dependency injection fails loudly at startup with anUnknownDependenciesException— I confirmed this directly, and the error message names the exact missing provider token and suggests the fix. The app won't boot at all, so this mistake gets caught immediately.If
@Processordoesn't match — while@InjectQueue/registerQueuestill agree with each other — the app boots without any error, jobs enqueue successfully, and then just sit in the queue forever. I verified this too: enqueued a job against a correctly-wired producer, waited, and checked the queue directly — the job's state was stillwaiting, indefinitely, with nothing logged anywhere to indicate why. No worker was ever listening on that name. This is the genuinely silent failure mode worth watching for, since nothing about it looks broken until someone notices a backlog that never drains.
Enqueueing a job
The job's payload type is defined right alongside the service that creates it — enqueue() and the worker (below) both depend on this shape:
export interface DraftJobData {
draftJobId: string;
topic: string;
simulateFailures: number;
}
async enqueue(dto: CreateDraftDto) {
const draftJob = await this.draftJobsRepository.save(
this.draftJobsRepository.create({ topic: dto.topic, status: DraftJobStatus.PENDING }),
);
await this.draftQueue.add(
'generate',
{ draftJobId: draftJob.id, topic: dto.topic, simulateFailures: dto.simulateFailures ?? 0 },
{
jobId: draftJob.id,
attempts: 3,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: { age: 3600 },
removeOnFail: { age: 86400 },
},
);
return { id: draftJob.id, status: draftJob.status };
}
A few specific choices worth calling out:
jobId: draftJob.id — using the Postgres row's own id as the BullMQ job id, rather than letting BullMQ generate one, is what makes re-enqueueing idempotent (more on this below).
backoff: { type: 'exponential', delay: 2000 } — each retry waits roughly double the previous gap, rather than hammering a struggling upstream API at a fixed interval. With attempts: 3 there are only ever two such gaps to observe, not three — I measured them directly against this exact config rather than trusting the formula from memory: ~2035ms after the first failure, ~4009ms after the second, and no fourth attempt ever fires once those three tries are exhausted. The delay keeps doubling if you raise attempts higher, but at attempts: 3 specifically, a third gap is never reachable — that ceiling comes from attempts, not from backoff alone.
removeOnComplete/removeOnFail — without these, BullMQ keeps every job's data in Redis indefinitely. These ages keep Redis from growing unbounded while still leaving failed jobs around longer (a day, vs. an hour for successes) since they're more likely to need debugging.
The worker
@Processor('draft-generation', { concurrency: 5 })
export class DraftGenerationProcessor extends WorkerHost {
constructor(private readonly draftsService: DraftsService) {
super();
}
async process(job: Job<DraftJobData>): Promise<string> {
const { draftJobId, topic, simulateFailures } = job.data;
await this.draftsService.markProcessing(draftJobId);
// Throwing here is what tells BullMQ to retry, subject to `attempts`/`backoff`.
return generateDraftContent(topic, job.attemptsMade, simulateFailures);
}
@OnWorkerEvent('completed')
async onCompleted(job: Job<DraftJobData>, result: string) {
await this.draftsService.markCompleted(job.data.draftJobId, result, job.attemptsMade);
}
@OnWorkerEvent('failed')
async onFailed(job: Job<DraftJobData> | undefined, error: Error) {
if (!job) return;
const maxAttempts = job.opts.attempts ?? 1;
if (job.attemptsMade >= maxAttempts) {
await this.draftsService.markFailed(job.data.draftJobId, job.attemptsMade, error.message);
}
// else: still has retries left, BullMQ will reschedule it automatically
}
}
@Processor(...) plus extending WorkerHost is the @nestjs/bullmq pattern for defining a worker — process() is the actual job handler, and @OnWorkerEvent hooks into BullMQ's lifecycle events. (The actual source file also logs each attempt via Nest's Logger, trimmed from the snippet above for readability — the logic shown is otherwise unchanged from what's in the repo.)
The detail that's easy to get backwards: attemptsMade
BullMQ's 'failed' event fires after every failed attempt, not just the last one. If you write the failure-handling logic without checking attempt count, a job that's about to succeed on its third try gets incorrectly marked failed after its first.
The fix is the job.attemptsMade >= maxAttempts check above — but getting that comparison right depends on knowing exactly what attemptsMade contains at each point, which isn't obvious from the type signature alone. I checked this directly against a running BullMQ instance rather than assuming:
attemptLog (job.attemptsMade as seen INSIDE the processor on each run):
[ { attemptsMade: 0, opts: 3 },
{ attemptsMade: 1, opts: 3 },
{ attemptsMade: 2, opts: 3 } ]
final 'completed' event attemptsMade: 3
attemptsMade is 0 on the very first execution, not 1 — it counts completed prior attempts, not the current attempt number. Inside process(), a job configured with attempts: 3 sees 0, 1, 2 across its three tries; the 'completed'/'failed' event handlers see it afterward, already incremented to 3. Getting this backwards (checking attemptsMade <= maxAttempts instead of >=, or assuming attempt 1 reads as 1 inside the processor) produces a retry condition that's off by exactly one — either giving up one attempt early, or never giving up at all.
What "idempotent" actually means here
Using the Postgres row's id as the BullMQ jobId means re-enqueueing the same id is a no-op — but it's worth being precise about what that actually does, rather than taking it on faith. I tested this directly:
await queue.add('task', { n: 1 }, { jobId: 'fixed-id' }); // creates the job
await queue.add('task', { n: 2 }, { jobId: 'fixed-id' }); // returns a Job object, but...
const stored = await queue.getJob('fixed-id');
console.log(stored.data); // { n: 1 } — the SECOND add() never took effect
The second add() call doesn't throw, and it doesn't error — it just silently does nothing. The job already in Redis under that id keeps its original data. This holds whether the original job is still waiting, actively processing, or has already completed (as long as it hasn't been cleaned up by removeOnComplete).
That's exactly the property you want for a reconciliation or retry path elsewhere in your own backend: if something upstream of enqueue() ever calls it twice for the same logical request — a retried HTTP call, a duplicate webhook delivery, a race in a distributed system — the second call doesn't create a second job, doesn't reprocess, and doesn't overwrite the first job's data with whatever the second call happened to pass. It's a much cheaper idempotency guarantee than building your own deduplication table, but it only works because the jobId is something stable and meaningful (the Postgres row's id) rather than an auto-generated one.
Setting up and running the project
Clone the repo, install dependencies, and copy the environment template:
git clone <your-repo-url>
cd bullmq-nestjs-demo
npm install
cp .env.example .env
This project needs both Postgres and Redis running locally — Postgres for the durable job records, Redis for BullMQ's own queue state. The fastest path for both is Docker:
docker run --name jobs-postgres \
-e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=jobs_demo \
-p 5432:5432 -d postgres:16
docker run --name jobs-redis -p 6379:6379 -d redis:7
The defaults in .env.example already line up with those two containers, so you shouldn't need to edit .env if you used them as-is. Then start the API:
npm run start:dev
synchronize: true is enabled in app.module.ts for this demo, so the draft_job table is created automatically on first boot — no manual migration step needed to follow along. (Turn that off and switch to real migrations before deploying anywhere real.) The worker runs inside this same process for simplicity here; see the production notes below on why you'd typically split it out.
With the server up on http://localhost:3000, you're ready to run through the flow below.
Testing it end-to-end
This is the actual sequence run against a live server, real Postgres, and real Redis before publishing — happy path, retry-then-succeed, permanent failure, and the validation/not-found edge cases:
# Happy path — no simulated failures
curl -X POST http://localhost:3000/drafts \
-H "Content-Type: application/json" \
-d '{"topic":"NestJS background jobs"}'
# -> { "id": "...", "status": "pending" }
curl http://localhost:3000/drafts/<id>
# -> { "status": "completed", "attemptsMade": 1, "result": "..." }
# Forces 2 failures before success — watch attemptsMade end at 3,
# with ~2s then ~4s of exponential backoff between attempts
curl -X POST http://localhost:3000/drafts \
-H "Content-Type: application/json" \
-d '{"topic":"Retry demo","simulateFailures":2}'
curl http://localhost:3000/drafts/<id>
# (poll a few times over ~6-8s)
# -> { "status": "completed", "attemptsMade": 3, "result": "..." }
# More failures than the configured 3 attempts allow — exercises the
# permanent-failure path instead
curl -X POST http://localhost:3000/drafts \
-H "Content-Type: application/json" \
-d '{"topic":"Permanent failure demo","simulateFailures":5}'
curl http://localhost:3000/drafts/<id>
# -> { "status": "failed", "attemptsMade": 3, "failureReason": "Simulated upstream failure..." }
# Validation and not-found paths, checked the same way
curl -X POST http://localhost:3000/drafts \
-H "Content-Type: application/json" \
-d '{"topic":"a"}'
# -> 400 Bad Request — topic is shorter than the DTO's @MinLength(3)
curl http://localhost:3000/drafts/00000000-0000-0000-0000-000000000000
# -> 404 { "message": "Draft job not found" }
Every path above — immediate success, eventual success after retries, permanent failure after exhausting retries, a rejected validation error, and an unknown id — was checked against the actual status, attemptsMade, and HTTP status code returned by a live server, not just "the request didn't error." The duplicate-jobId idempotency behavior from the previous section was verified the same way, against this same running instance, using a small script that called queue.add() directly rather than going through the HTTP API (since triggering a raw duplicate enqueue isn't something a normal client request can do on its own).
Production notes
A few things were deliberately simplified for this demo and are worth tightening before shipping:
Run the worker as a separate process from the API. Here, the processor lives inside the same NestJS app as the controller — fine for a demo, but in production a traffic spike on the HTTP side will compete for CPU with job processing unless they're split into independent, independently-scalable deployments.
Distinguish retryable from non-retryable errors. A timeout or a 429 should retry. A 401 from a bad API key will fail identically on every attempt —
throw new UnrecoverableError(...)(exported by BullMQ) skips the remaining retries instead of wasting them on a guaranteed failure.Size
concurrencyto what the upstream and your database can actually sustain.concurrency: 5here is a demo default, not a number derived from real capacity.Monitor queue depth, not just individual job outcomes. A queue that's silently backing up faster than it's draining is a different failure mode than any single job failing, and won't show up by looking at one job's status at a time.
There's no authentication on these endpoints. This demo is scoped to the queueing mechanics, not access control — anyone who can reach
POST /draftscan enqueue work, and anyone who knows (or guesses) a job id can read its result. If this sits behind a real API, put it behind the same kind of auth boundary as anything else you wouldn't want publicly writable (a prior post on this blog covers building TOTP-based 2FA in NestJS, if that's useful context for the kind of guard logic involved).
Source code
The complete, tested implementation — NestJS module, the worker, entity, DTOs, and a README with the full setup and curl walkthrough — is available as a standalone repository: bullmq-nestjs-demo. Clone it, point it at your own Postgres and Redis, and the enqueue → retry → status-poll flow above works out of the box.
Comments (0)
Login to post a comment.