How Fast (and How Efficient) Are LLMs at Validating Code?

0 comments

Sometimes I just want the real number: how long does a task take when I press Enter? No demos, no ideal conditions — just wall-clock and a fair comparison.

So I ran a simple experiment. The job was straightforward: read a source file, read a static-analysis report, decide whether each finding is real or a false positive, and return results as newline-delimited JSON (NDJSON). No rewrites. No fixes. Just validation.

Why This Project Exists

The goal is simple: find inefficiencies in source code — the habits and patterns that waste time, energy, memory, or money. This work builds on the soon-to-be-published Software Inefficiency Catalog from the Sustainable IT Manifesto Foundation. The catalog names the patterns; this project operationalizes them.

Static analysis (AST) is powerful but noisy. It flags patterns without understanding context. A classic example: a rule that flags string concatenation in loops might misinterpret an integer increment (x += 1) as string concatenation. The pattern looks similar, but the intent is different — a false positive. Enough of those, and teams stop trusting the results.

To fix that, I added a lightweight LLM validation layer.
The AST scanner catches everything; the LLM filters out the noise. It reads the file, checks each reported issue, and decides whether it’s real or not — outputting just a yes/no (ok) and a one-line reason (y) in NDJSON format.

No rewrites. No refactors. Just validation.

What This Catches (and Filters)

Keeps: real issues such as broad except Exception, missing upload limits (MAX_CONTENT_LENGTH), and true string concatenation in loops.
Filters: look-alikes and false alarms, like numeric increments mistaken for string operations, or loops that appear N+1 but actually batch.

The Flow

+----------------+         +------------------+         +----------------+
|  Source Files  | ----->  |  AST Analyzer    | ----->  |  Findings File |
+----------------+         +------------------+         +----------------+
                                    |
                                    v
                        +------------------------+
                        | LLM Validator (NDJSON) |
                        +------------------------+
                                    |
                                    v
                         +----------------------+
                         | CI / Dashboard / DB  |
                         +----------------------+

The AST does the wide sweep; the LLM applies judgment. The output rolls back into CI or a dashboard — cleaner, higher trust, and actionable.

Example: False Positive Correction

Before (AST finding):

runtime_analysis.py:46:12: warning [ECO-PY-000] string concatenation in loop

Actual code:

for i in range(10):
    x += 1

LLM output:

{"f":"runtime_analysis.py","l":46,"c":12,"i":"string concatenation in loop","ok":0,"y":"numeric increment, not string concat"}

That’s the idea: the model keeps the focus on real inefficiencies and filters out the noise.

The Prompt I Used (Exact)

**Purpose:** Validate whether each issue reported by an AST or static analysis tool is a true or false positive by inspecting the code directly, and output the results as newline-delimited JSON with short keys for maximum portability and compactness.

---

**Prompt:**

I am uploading two files:

1. A **source code file** (any language — `.py`, `.js`, `.java`, `.go`, `.cs`, etc.)
2. An **AST or static analysis report** (e.g., `analysis.txt`)

Each line of the analysis report includes:

* Filename
* Line number
* Offset / column
* Potential issue description

Please read both files, cross-check each reported issue, and validate whether the analysis finding is correct or a false positive.

---

**Output format:**
Return results as **newline-delimited JSON (NDJSON)** using the following short keys:

| Key  | Meaning                                                                 |
| ---- | ----------------------------------------------------------------------- |
| `f`  | Filename (string)                                                       |
| `l`  | Line number (int)                                                       |
| `c`  | Offset / column (int)                                                   |
| `i`  | Issue description (string)                                              |
| `ok` | Whether the analysis is correct (1 = true positive, 0 = false positive) |
| `y`  | Short explanation (“why”)                                               |

**Example Output (NDJSON):**

{"f":"runtime_analysis.py","l":1,"c":0,"i":"MAX_CONTENT_LENGTH not set","ok":1,"y":"No reference found in file"}
{"f":"runtime_analysis.py","l":25,"c":4,"i":"broad except Exception","ok":1,"y":"Found at line 25"}
{"f":"runtime_analysis.py","l":98,"c":22,"i":"possible N+1","ok":0,"y":"Batched insert via insert_bulk_assets"}
{"f":"runtime_analysis.py","l":46,"c":12,"i":"string concat in loop","ok":0,"y":"Numeric increment, not string concat"}

---

**Rules:**

1. Validate only the accuracy of the analysis — no fixes or rewrites.
2. Mark `"ok":1` only if the issue is actually present and correctly identified.
3. Mark `"ok":0` if the analysis misinterpreted the code (false positive).
4. Provide a concise reason in `"y"`.
5. Assume code compiles; analyze statically only.
6. Return only NDJSON — no extra commentary, markdown, or explanations.

The Report (`analysis.txt`)

runtime_analysis.py:1:0: warning [ECO-PY-000] MAX_CONTENT_LENGTH not set; uploads may be unbounded
runtime_analysis.py:25:4: warning [ECO-PY-000] broad except Exception (narrow to specific exceptions)
runtime_analysis.py:98:22: warning [ECO-PY-000] possible N+1 query pattern in loop; batch or prefetch instead
runtime_analysis.py:46:12: warning [ECO-PY-000] string concatenation in loop (use list + ''.join(...) or io.StringIO)
runtime_analysis.py:51:8: warning [ECO-PY-000] string concatenation in loop (use list + ''.join(...) or io.StringIO)
runtime_analysis.py:124:20: warning [ECO-PY-000] string concatenation in loop (use list + ''.join(...) or io.StringIO)

Hardware and Test Setup

All local models ran on a GEEKOM A9 Max with an AMD Ryzen AI 9 HX 370 and 96 GB RAM.
Only ChatGPT 5 and Claude 4.5 were run remotely in the cloud.

Results: Time to Complete

LLM	Time (s)
Claude 4.5	10.95
ChatGPT 5	20.99
qwen3-coder-30b-a3b-instruct_pruned_reap-15b-a3b	22.240
ijohn07/deepseek-coder-v2-lite-instruct	24.643
qwen3-coder-reap-25b-a3b_moe	25.195
lmstudio-community/deepseek-coder-v2-lite-instruct	26.468
qwen3-coder-30b-a3b-instruct-q2ks-mixed-autoround	27.388
unsloth/qwen3-coder-30b-a3b-instruct	27.827
qwen/qwen3-coder-30b	29.312
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf	41.657
Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf	85.833
strand-rust-coder-14b-v1	104.348

Claude 4.5 was fastest overall. Most local models completed in the 20–30 second range — which is entirely practical for code validation.

Power and Environmental Footprint

On the Ryzen AI 9 HX 370 (28–54 W under load), a 25-second run uses about 0.0004 kWh — roughly the energy of a small LED bulb for a few seconds. Even with overhead, it’s under one watt-hour per run.

Cloud inference, by contrast, runs across data-center infrastructure. A single request can use 0.01–0.05 kWh, depending on model size and efficiency.

Quick CO₂ Compare (per run)

Assumptions: local ≈ 0.0004 kWh; cloud ≈ 0.01–0.05 kWh; emission factor 0.4 kg CO₂e/kWh.

Scenario	Energy (kWh)	CO₂e (g)
Local (Ryzen AI 9 HX 370, ~25 s)	0.0004	0.16 g
Cloud (small request, low end)	0.0100	4 g
Cloud (small request, high end)	0.0500	20 g

Takeaway: a local run emits tenths of a gram; a cloud run, a few to twenty grams. Both drop if powered by renewables, but the scale difference remains.

Why I’m Testing Local vs. Cloud

This project looks beyond speed. It’s about understanding how running LLMs locally compares in three dimensions:

Speed: how long the validation takes in real time
Isolation: whether it can run without depending on a network
Environmental footprint: how much power (and therefore carbon) it uses per run

The difference isn’t just academic — it’s practical. Local inference gives control and transparency; cloud inference gives convenience and scale.

“Good Enough” Beats “Fastest”

Claude 4.5 finished in about eleven seconds. The Qwen and DeepSeek models finished in the 20–30 second range.
For this job — checking a handful of AST findings — that’s good enough.

The difference between eleven seconds and twenty-five seconds isn’t what matters.
What matters is that the tool finishes before focus breaks, runs locally if needed, and doesn’t waste unnecessary power doing it.

Benchmarks are interesting, but sustainability and practicality matter more.
For this kind of structured reasoning, “good enough” is both fast and sustainable — and that’s the sweet spot I’m after.

Next Steps

The next step is to refine what the model sees.
Right now, each validation run includes the entire file, even if the AST only flagged a few lines. The next version will instead send contextual chunks — just the code surrounding each reported issue.

That change will test how much time (and energy) can be saved by trimming unnecessary input.
Less data means fewer tokens to process, which should mean faster runs — and less power use per validation.

If the results hold, this could make LLM-backed static analysis both leaner and greener, closing the loop between accuracy, performance, and sustainability.

Here’s the closing section tying everything back to the Sustainable IT Manifesto Foundation’s mission and the Aware → Conscious → Enabled → Empowered journey, written in your voice and ready to append to the blog post:

Connecting Back to the Sustainable IT Manifesto Foundation

This whole project — the AST scans, the LLM validation layer, the efficiency testing, and the timing and power measurements — fits neatly into the broader mission of the Sustainable IT Manifesto Foundation and Making Software Greener.

The Foundation’s work is built on a simple progression:

Aware → Conscious → Enabled → Empowered

This experiment touches all four:

Aware

Understanding where inefficiencies hide in source code — even the small ones — is the first step. AST tools raise awareness by surfacing everything they can find, even if they’re a bit noisy.

Conscious

Adding an LLM to validate AST findings makes the process more intentional. Instead of drowning in false positives, we start to see the real patterns — the ones that matter.

Enabled

Once the noise is filtered out, teams can actually act on the findings. They get clear, trustworthy signals about which inefficiencies to fix and why. NDJSON isn’t glamorous, but it makes the results easy to plug into CI, dashboards, or developer workflows.

Empowered

Over time, this pipeline becomes more than a tool — it becomes a capability. Teams can catch inefficient patterns earlier, reduce waste in their systems, and improve their software’s environmental footprint in a practical, day-to-day way.

This is the heart of sustainable IT:
small, concrete steps that add up to lasting change.

By combining AST analysis, LLM reasoning, and lightweight power measurements, we’re not just making code cleaner — we’re building habits and systems that help people write greener software without slowing them down.

And as the Software Inefficiency Catalog rolls out, this kind of validation pipeline becomes a natural extension of the Foundation’s mission: making sustainability accessible, actionable, and part of the everyday workflow.

How Fast (and How Efficient) Are LLMs at Validating Code?

Enabling Sustainable Innovation in Software Development

Why This Project Exists

What This Catches (and Filters)

The Flow

Example: False Positive Correction

The Prompt I Used (Exact)

The Report (`analysis.txt`)

Hardware and Test Setup

Results: Time to Complete

Power and Environmental Footprint

Quick CO₂ Compare (per run)

Why I’m Testing Local vs. Cloud

“Good Enough” Beats “Fastest”

Next Steps

Connecting Back to the Sustainable IT Manifesto Foundation

Aware

Conscious

Enabled

Empowered

Category

Tags

No responses yet

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Recent Posts

Recent Comments

Archives

Categories

Meta

How Fast (and How Efficient) Are LLMs at Validating Code?

Enabling Sustainable Innovation in Software Development

Why This Project Exists

What This Catches (and Filters)

The Flow

Example: False Positive Correction

The Prompt I Used (Exact)

The Report (analysis.txt)

Hardware and Test Setup

Results: Time to Complete

Power and Environmental Footprint

Quick CO₂ Compare (per run)

Why I’m Testing Local vs. Cloud

“Good Enough” Beats “Fastest”

Next Steps

Connecting Back to the Sustainable IT Manifesto Foundation

Aware

Conscious

Enabled

Empowered

Category

Tags

No responses yet

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Recent Posts

Recent Comments

Archives

Categories

Meta

The Report (`analysis.txt`)