Or: How a production bug reminded me that architecture is really about contracts.

Recently I found myself investigating a production issue that, at first glance, didn’t make much sense.

A system was downloading PDF files, storing them, and passing them along to another component for processing. It had been doing this successfully hundreds of times a day.

Until it wasn’t.

Occasionally a PDF would arrive truncated.

Not corrupted in the traditional sense.

Not unreadable because of random bits.

Just… incomplete.

The first instinct was to look for an obvious failure.

Did the download fail?

No.

Did the write fail?

No.

Did the upload fail?

No.

Every step reported success.

Yet the consumer occasionally received a file that was missing tens of thousands of bytes.

The software wasn’t lying.

Our assumptions were.

What Does “Success” Actually Mean?

One of the easiest traps in software engineering is assuming an API guarantees more than it actually does.

A successful write operation tells you something important:

The operating system accepted your request to write data.

That’s a real guarantee.

But notice what it doesn’t necessarily guarantee.

It doesn’t tell you:

  • every byte you intended to write is actually present in the file,
  • another process won’t observe a partially written file,
  • the file is complete,
  • the contents are valid,
  • or that another component will read exactly what you expected it to read.

Those are different guarantees.

Some require additional synchronization.

Some require architectural patterns.

Some require explicit verification.

And some simply require everyone involved to agree on what “finished” actually means.

The Bug Wasn’t Really About Files

As I dug deeper, it became clear this wasn’t fundamentally a file-system problem.

It was an architectural one.

One process assumed:

“If the file exists, it’s ready.”

Another process assumed:

“If the write returned success, the file is complete.”

Neither assumption was unreasonable.

Neither assumption was actually guaranteed.

Between those two assumptions lived the bug.

This is something I see repeatedly across software systems.

The code often behaves exactly as each individual component was designed.

The failure happens between components.

Making the Contract Explicit

The eventual solution wasn’t particularly exotic.

Instead of assuming the file was ready because it existed, the producer and consumer established a stronger contract.

The producer would:

  • write to a temporary file,
  • ensure the expected number of bytes had been written,
  • flush data if durability mattered,
  • atomically rename the temporary file into its final location,
  • and only then make the file visible to consumers.

Now “file exists” actually meant something.

The architecture enforced the contract instead of relying on everyone to make the same assumption.

That’s a subtle but important distinction.

This Pattern Is Everywhere

Once you start looking for implicit contracts, you see them everywhere.

“We got an HTTP 200.”

Did the business operation succeed?

“The message was sent.”

Did anyone process it?

“The cache contains a value.”

Is it still correct?

“The deployment finished.”

Is the system healthy?

“The write succeeded.”

Is the file actually complete?

Many production incidents don’t happen because components violate their contracts.

They happen because we quietly assume contracts that were never promised.

Architecture Is About Contracts

People often think of architecture as boxes and arrows.

Databases.

Queues.

Microservices.

Cloud providers.

Those things matter.

But I’ve increasingly come to think that architecture is really about something else.

It’s about defining the contracts between those pieces.

More importantly, it’s about making implicit contracts explicit.

When those contracts remain assumptions, bugs hide in the gaps.

When they’re explicit, they become things we can reason about, verify, test, and enforce.

The architecture itself becomes more resilient.

A Lesson Worth Remembering

One of the recurring lessons I’ve learned over the years is this:

APIs report what they guarantee—not necessarily what we hope they guarantee.

As engineers, it’s our job to understand that distinction.

As architects, it’s our job to design systems that don’t depend on assumptions masquerading as guarantees.

Because good architecture isn’t just about organizing components.

It’s about making implicit contracts explicit.

And perhaps that’s one of the quiet ways software becomes more survivable.


Takeaways

  • Success is not correctness.
  • Acceptance is not completion.
  • Every API has guarantees. Bugs often begin where our assumptions exceed them.
  • Good architecture makes implicit contracts explicit.
Tags

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *