COBOL.fyi

Leading Hub for COBOL and Mainframes
Back to Articles
COBOL

How to Build a Strangler Fig Pattern Around a COBOL Batch Job

Strangling a COBOL batch job is much harder than replacing a CICS transaction because batch systems have no clean API boundary, so teams must first map build their own routing and validation layers.

Saturday, February 21, 2026·8 min read
How to Build a Strangler Fig Pattern Around a COBOL Batch Job

If you have strangled a CICS online transaction before, you know the playbook. Stand up a service behind an API gateway, route traffic incrementally, retire the old program once the new one handles 100% of requests. Clean seams exist because CICS transactions already behave like request-response endpoints. Batch jobs offer no such luxury.

A COBOL batch job is orchestrated by JCL (Job Control Language, the z/OS scripting layer that sequences programs, allocates files, and controls condition codes). JCL is the API boundary, but it is an implicit one. There is no URL, no message queue, no service contract. The "interface" is a chain of EXEC PGM= steps reading and writing VSAM and QSAM datasets (indexed and sequential file formats native to the mainframe). Programs communicate through shared files, shared copybooks (reusable COBOL data structure definitions), and sometimes through CALL statements that reference other programs by name stored in a variable, making static analysis miss them entirely.

This is why batch jobs are harder to strangle than online programs. The natural API boundary that makes the strangler fig pattern elegant for CICS does not exist here. You have to build it.

What the Strangler Fig Pattern Actually Means for Batch COBOL

Martin Fowler coined the term in 2004, inspired by strangler figs he observed in Queensland rain forests. The pattern describes gradual replacement of a legacy system: new functionality grows around the old system, intercepts its inputs and outputs, and eventually replaces it entirely (Martin Fowler, StranglerFigApplication). The key principle is that you never do a big-bang rewrite. You build transitional architecture that lets both systems coexist, then retire the old one piece by piece.

For a CICS transaction, the "seam" is obvious: it already responds to a terminal or network request. For a batch job, the seam is buried. A nightly settlement run might execute 14 JCL steps, each invoking a different COBOL program, each reading output files from the previous step, some updating shared VSAM clusters that other jobs in the schedule also touch. There is no single request to intercept. There is a pipeline.

The batch-specific version of the strangler fig requires you to treat each JCL step as a potential extraction point. You replace steps individually, not entire jobs. And you must solve a problem that CICS stranglers rarely face: keeping file-based data flows intact while one step runs modern code and the next step still expects a VSAM KSDS as input.

IN-COM Data Systems describes the prerequisites well: before any extraction, you need a "deep, documented understanding of legacy system operations, vulnerabilities, and component separability" (IN-COM, Strangler Fig Pattern in COBOL System Modernization). For batch, that means mapping every file dependency, every inter-program call, and every implicit assumption about execution order.

Step 1: Map the Job Before You Touch It

Most failed batch modernizations skip this step or do it superficially. A job that "just runs the nightly settlement" turns out to have 30 programs, 47 datasets, and a web of copybook dependencies that cross four application domains.

Cross-reference analysis. Start with the JCL itself. Every EXEC PGM= step names a load module. Every DD statement names a dataset. Build a graph: nodes are programs and datasets, edges are reads and writes. Tools like SMART TS XL or IBM Application Discovery can automate this, but even a manual pass through the JCL gives you the skeleton.

Control flow mapping. JCL condition codes (COND parameters and IF/THEN/ELSE constructs) determine which steps execute. A step with COND=(4,LT) will be skipped if any prior step returned a code less than 4. These conditional paths create hidden branches in your batch pipeline that you must understand before you extract any step.

VSAM and QSAM file sharing. VSAM (Virtual Storage Access Method) files are the mainframe equivalent of a database table, and QSAM (Queued Sequential Access Method) files are flat sequential datasets. The critical question: which files are shared across jobs? A VSAM KSDS updated by your nightly job might also be read by a morning reporting job and an online CICS inquiry. Extracting the step that writes that file means your modern replacement must produce byte-compatible output.

Shared copybooks. COBOL programs define record layouts in copybooks. If five programs across three jobs all COPY the same member, changing the record layout in the modern replacement will break all of them. Catalog every copybook reference before you start. For guidance on identifying shared state before you start, see our detailed walkthrough.

Implicit program calls. The statement CALL WS-PROGRAM-NAME, where WS-PROGRAM-NAME is a working storage variable populated at runtime, defeats static analysis. You need runtime trace data (SMF type 30 records or similar) to discover which programs actually get called. AWS Transform analyzes SMF data alongside source code to identify active batch jobs, MIPS consumption, and unused programs, which helps separate the living code from the dead.

The output of this step is a dependency map. Every program, every file, every copybook, every conditional path. If you cannot draw this map, you are not ready to strangle anything.

Step 2: Build the Intercept Layer

The strangler fig needs a point of interception. For batch, that point is the JCL step itself. Replace the EXEC PGM= directive for a target step with a routing program that decides at runtime whether to call the legacy COBOL program or the modern replacement.

This routing program is your API facade. It reads a feature flag (stored in a control file, a Db2 table, or an environment variable passed through JCL symbolic parameters), then either CALLs the original program or invokes the modern service via an HTTP call, MQ message, or whatever integration pattern your target architecture uses.

Here is what the JCL transformation looks like:

Before (original JCL):

```

//SETTLE   JOB  (ACCT),'NIGHTLY SETTLEMENT',CLASS=A

//STEP1    EXEC PGM=LEGACYPGM

//INPUT    DD   DSN=PROD.SETTLE.INPUT,DISP=SHR

//OUTPUT   DD   DSN=PROD.SETTLE.OUTPUT,DISP=(NEW,CATLG),

//              SPACE=(CYL,(50,10)),DCB=(RECFM=FB,LRECL=200)

//SYSOUT   DD   SYSOUT=*

After (with routing program):

//SETTLE   JOB  (ACCT),'NIGHTLY SETTLEMENT',CLASS=A

//STEP1    EXEC PGM=ROUTERPGM,

//              PARM='STEP=SETTLE01,FLAG=PROD.FEATURE.FLAGS'

//INPUT    DD   DSN=PROD.SETTLE.INPUT,DISP=SHR

//OUTPUT   DD   DSN=PROD.SETTLE.OUTPUT,DISP=(NEW,CATLG),

//              SPACE=(CYL,(50,10)),DCB=(RECFM=FB,LRECL=200)

//FLAGS    DD   DSN=PROD.FEATURE.FLAGS,DISP=SHR

//SYSOUT   DD   SYSOUT=*

ROUTERPGM pseudocode (COBOL):

IDENTIFICATION DIVISION.

PROGRAM-ID. ROUTERPGM.

DATA DIVISION.

WORKING-STORAGE SECTION.

01 WS-FLAG-RECORD.

05 WS-STEP-ID        PIC X(10).

05 WS-USE-MODERN      PIC X(1).

88 ROUTE-MODERN     VALUE 'Y'.

88 ROUTE-LEGACY     VALUE 'N'.

01 WS-RETURN-CODE        PIC S9(4) COMP.

PROCEDURE DIVISION.

OPEN INPUT FLAGS-FILE

READ FLAGS-FILE INTO WS-FLAG-RECORD

CLOSE FLAGS-FILE

EVALUATE TRUE

WHEN ROUTE-MODERN

CALL 'MODERNAPI' USING INPUT-RECORD

OUTPUT-RECORD

WHEN ROUTE-LEGACY

CALL 'LEGACYPGM' USING INPUT-RECORD

OUTPUT-RECORD

END-EVALUATE

MOVE RETURN-CODE TO WS-RETURN-CODE

STOP RUN.

```

The routing program preserves the original DD statements. Downstream steps still read the same OUTPUT dataset. The only change visible to the rest of the job stream is the EXEC PGM= name and the addition of the FLAGS DD.

This pattern aligns with what IN-COM describes as the "API facade layer" that "acts as a controlled entry point intercepting calls to legacy COBOL logic and redirecting to modernized services" (IN-COM, Strangler Fig Pattern in COBOL System Modernization). The difference in the batch context is that your facade is a COBOL program itself, not an API gateway appliance.

If your modern replacement is an off-mainframe service, ROUTERPGM will need to make an outbound call. IBM z/OS Connect or a lightweight MQ bridge can handle this. The routing program writes input to an MQ queue, the off-platform service processes it and writes the result back, and ROUTERPGM writes the output dataset in the format downstream steps expect. Batch the records into chunks; an HTTP round-trip per record will blow your batch window.

Step 3: Coexistence and Data Synchronization

The dual-running phase is where most complexity lives. Some steps run modern code, others run legacy COBOL. Both need consistent data.

One-way sync (legacy to modern). Legacy VSAM files remain the system of record. Change data capture (CDC) replicates updates to the modern datastore. Tools like IBM InfoSphere CDC or Precisely Connect can stream VSAM changes to Kafka, which feeds your cloud-side database. The modern service reads from its own store but writes results back in VSAM format for downstream legacy steps. Use this when most of the job stream is still legacy.

One-way sync (modern to legacy). Once most steps have been migrated, flip the direction. The modern datastore becomes the system of record. A sync process writes back to VSAM for the remaining legacy steps. This is the endgame position before full decommission.

Bi-directional sync. Avoid this if you can. Bi-directional replication between VSAM and a cloud database introduces conflict resolution problems that are hard to reason about in a batch context. If you must do it, use a last-writer-wins strategy with timestamps and accept occasional update losses during transition. IN-COM notes that data sync layers should be "performance-tested for peak workloads to avoid latency spikes" (IN-COM, Strangler Fig Pattern in COBOL System Modernization). In batch, peak workload is the entire batch window. Test under full volume.

The critical rule: never let the modern service and the legacy program write to the same file simultaneously. VSAM record-level sharing (RLS) supports concurrent access, but mixing a legacy batch update with a modern service writing through a CDC bridge is a recipe for integrity failures. Serialize access by step.

For teams evaluating how to prioritize which components to migrate first, the data sync burden should be a key input. Steps that write to heavily shared datasets are expensive to strangle. Start with steps that write to datasets consumed by only one or two downstream programs.

Step 4: Behavioral Equivalence Testing

This is the step most teams skip, and it is the step that causes the most production incidents.

Batch COBOL programs produce deterministic output for a given input. The same input file, processed by the same program, with the same reference data, will produce byte-for-byte identical output every time. Your modern replacement must do the same. Not "functionally equivalent." Byte-for-byte identical. The downstream step that reads your output expects records at specific offsets, in specific EBCDIC encoding, with specific packed decimal formats.

Golden master testing. Capture a production input file and its corresponding output. This is your golden master. Run the same input through your modern service. Compare the output byte by byte. Any difference is a bug. This technique was recommended in a discussion on r/softwarearchitecture, where practitioners advised: "Cover the flows with e2e and integration tests, then use strangler fig pattern to migrate it piece by piece." The golden master is your e2e test for batch.

Character encoding traps. COBOL on z/OS uses EBCDIC. Your modern service almost certainly uses ASCII or UTF-8. A field that reads "SMITH" in EBCDIC is hex E2D4C9E3C8. In ASCII it is hex 534D495448. If your modern service writes ASCII to a file that a downstream COBOL program reads as EBCDIC, every character will be wrong. Handle the conversion explicitly and test it obsessively.

Packed decimal precision. COBOL COMP-3 fields store numbers in packed decimal format. A PIC S9(7)V99 COMP-3 field occupies 5 bytes with exact decimal precision. IEEE floating point does not. If your modern service uses a float or double to represent currency, you will get rounding differences on roughly 1 in 10,000 records. Use a decimal type (BigDecimal in Java, decimal in C#, Decimal in Python).

Parallel run strategy. Run both the legacy program and the modern service on the same input. Compare outputs automatically. Do this for at least two full business cycles. IN-COM recommends "automated regression testing, golden master comparisons, transaction mirroring" as core validation techniques (IN-COM, Strangler Fig Pattern in COBOL System Modernization).

Do not skip parallel runs to save time. Every week of parallel running that catches a discrepancy saves you a weekend of emergency production fixes.

Step 5: Decommission Without Regret

Decommission is a decision, not an event. You need clear criteria and a monitoring period.

Criteria for retirement. The modern service has processed 100% of production volume through the routing program for at least 30 consecutive days with zero discrepancies in golden master testing. No manual overrides to the feature flag have been needed. Operations staff confirm that monitoring, alerting, and runbook procedures are in place for the modern path.

The 30/60/90 monitoring window. After flipping the flag permanently to the modern path:

  • Day 1 to 30: Keep the legacy load module in production libraries. The routing program stays in the JCL. If anything fails, flip the flag back in under a minute.
  • Day 31 to 60: Remove the routing program. Replace with a direct EXEC PGM= to the modern wrapper. Archive the legacy source and load modules.
  • Day 61 to 90: Decommission VSAM datasets used only for legacy processing. Update disaster recovery runbooks. Close the change ticket.

Resist the temptation to decommission immediately after parallel runs succeed. Batch jobs encounter edge cases on specific calendar boundaries (month-end, quarter-end, leap years, holidays that shift processing dates). Your 30-day parallel run might not have covered all of them.

Where This Pattern Breaks Down

The strangler fig is not universal. Several batch-specific scenarios resist it.

Implicit timing dependencies. Some batch jobs depend on running at a specific point in a schedule, not just on their input data. A job that must run after the general ledger close but before the overnight interest calculation has a temporal coupling that is invisible in the JCL. Mainframe job schedulers (CA-7, TWS, Control-M) encode these dependencies. Your modern orchestrator must replicate them exactly.

Shared file mutations within a job stream. If steps 3, 7, and 12 of a 15-step job all update the same VSAM file, you cannot strangle step 7 in isolation. The modern replacement must read the file as step 3 left it and leave it in the state step 12 expects. In practice, you often have to strangle steps 3, 7, and 12 together as a unit.

Undocumented business logic. A COBOL program written in 1987 contains business rules that no living employee understands. Nobody knows why it multiplies the settlement amount by 1.00375 on the third Tuesday of each quarter, but it does, and the auditors expect it. AWS Transform uses agentic AI to extract business logic from COBOL source code into natural language specifications, which helps. But AI extraction requires human validation, and if nobody understands the rule, validation is hollow. For more on what AI tooling can and cannot do in this process, see our analysis.

Performance-critical batch windows. A job that must complete in a 4-hour overnight window leaves no room for the overhead of a routing program, CDC sync, and parallel runs. If your batch window is already at 95% capacity, the strangler fig may be too expensive in elapsed time. You may need to expand the batch window or accept a riskier cutover.

Here is the honest observation about strangler fig migrations for COBOL batch: most teams never finish. They strangle the first three or four high-value steps, declare success, and live with a hybrid architecture for years. The remaining steps are low-risk, low-change programs that cost more to migrate than to maintain. That hybrid state becomes the permanent architecture.

This is not failure. The point of the strangler fig was never to eliminate every line of COBOL. It was to reduce risk, unlock business agility where it matters most, and stop the bleeding on the components that change frequently. If you strangle the 20% of batch steps that cause 80% of your maintenance burden, you have succeeded. The remaining COBOL programs will run for another decade, quietly, reliably, and cheaply. That is what "done" actually looks like.

#cobol#migration#fig-strangler