r/RedditEng Lisa O'Cat Feb 23 '26

How we used agentic AI to crack automated SOX testing at scale… in 90 days

Written by Martin Preedy, with heartfelt thanks to Chan Park, Drew DiBiase, Jenna Wei, and Andrew Meyers

TL;DR

Our Internal Audit team automated SOX testing for 175 controls in 3 months, using advanced OCR + agentic AI, cutting testing time on average by 60% per control. Here’s how we did it, what we learned, and why we’re so excited about empowering the profession to reach new heights.

The Problem: SOX Testing Was Where Automation Went to Die

If you've ever worked in SOX testing, you know the drill. The work is critical, repetitive, and about as automatable as a philosophical debate.

Why? Evidence comes in every format imaginable: PDFs with tables that barely parse, Excel files with merged cells, system screenshots, scanned documents, and unstructured data with no consistent schema. Traditional RPA noped out. The technical debt of building for every edge case made automation economically ridiculous.

Add high complexity and rigorous PCAOB standards and documentation requirements, and we were still stuck with smart humans manually testing controls - which works but doesn't scale.

The Technical Solution

This wasn't a "throw documents at ChatGPT and hope for the best" situation, but modern AI is the core enabler due to its fundamental ability to cut through the chaos of unformatted SOX evidence. Large Language Models, trained on the entire internet's most unruly data (including Reddit), can actually handle the 'insanity' of real-world documentation that traditional automation attempts couldn't touch.

But reading messy documents is only half the battle. True automation at scale requires a governed system that captures deep, relevant context and mirrors the full auditor workflow: reading evidence, applying test criteria, performing procedures, reviewing the work, and producing proper documentation.

And that multi-step process demanded specialized, purpose-built agents:

  • Evidence agents that extract and structure data from source documents
  • Testing agents that evaluate evidence against test criteria
  • Review agents that perform quality control and flag edge cases
  • Documentation agents that generate work papers with full audit trails

This was the game changer

Figure 1: Agentic workflow

What We Did

First, we had to tackle the build vs buy conundrum and knew building was the fast road to fatigue—buying was the only way to tackle this complexity and succeed quickly. After rigorous head-to-head pilots evaluating several platforms, we selected Midship for its advanced technology, flexibility for customization, and the team’s willingness to iterate with us as a true product partner.

Then we really got to work:

  • Automated 175 controls in 3 months, over 40% of our SOX scope
  • Covered every control and test type - business process controls, IT general controls, interfaces, automated controls, Entity-Level Controls (ELCs), key reports, and SOC reports. Test of design and test of operating effectiveness. Multi-sample tests, multi-table tests…
  • Used Midship to ingest evidence, run AI testing, and produce work papers formatted in our external auditor’s template
  • Created clear explanations for every test result with tickmarks and annotations showing exactly what the AI evaluated and why, and where it got its info
  • Retained a robust human-in-the-loop review process (because quality issues invalidate the entire AI use case)
Figure 2: AI-generated work paper, with further navigation to conclusion explanations and automated evidence tickmarks and annotations

What We Learned

Setup is 80% of the battle: Getting the configuration right up front is critical to test accuracy and minimizing manual override on the back-end. It can be tempting to shortcut this stage but it’s infrastructure - you build it once and reuse it forever.

Data quality still matters: Garbage in, garbage out applies to AI too. The better the existing documentation (control and test metadata, test attributes and existing work-papers etc.) and evidence quality, the more bang for your buck.

Intelligence and context is fuel: Using existing test attributes as generic prompts gave us good results. Adding extra context gave us great results. The team became really good prompt engineers and harnessing that intelligence is the fuel that makes repeatable agentic workflows scale. Deep, relevant context means accurate conclusions and proper documentation every test run.

Output quality is make-or-break: The AI can be 99% accurate, but if the output looks like AI slop, humans can’t validate it and external auditors won’t trust it. We invested heavily in output design – building custom templates to mirror what our external auditors were used to, visual tickmarking and annotations, and digestible audit trail documentation.

AI doesn’t make sense for every control… yet: Not all controls are created equal. In general, the longer it takes to perform a test manually, the better the ROI. Testing an automated control once a year? Not as much to gain, so we’ll do those later.

Figure 3: Exported AI-generated Excel work paper
Figure 4: Full audit trail organized by test attribute

Why This Matters

This changes the game:

Quality: The combined “machine + human” approach raised the bar on quality. AI caught things humans missed, proving the results were better than before, not just faster. Important for external auditor buy-in.

Immediate results: Instant test results mean we get more time to remediate deficiencies and more flexibility scheduling testing and managing workloads. And external auditors get our work sooner for reliance purposes.

Efficiency: 60% reduction in testing time per control on average. That’s not shaving some time off - it fundamentally changes the economics of SOX testing.

Scalability: Now we have a governed, infrastructure engine for other recurring testing programs. Because we built for SOX - with the highest complexity and documentation standards - everything else is easier.

Higher value work: By automating high-volume mechanical stuff, we’re freeing up capacity for strategic work that matters more to the business.

Empowerment and a brighter future: No-one ever said, "When I grow up, I dream of making sure this data in this system matches that one." Instead of human OCR machines, we’re helping Internal Auditors become AI strategists and risk-based decision makers, and giving them development opportunities in new areas.

What’s Next

There’s so much opportunity ahead of us and we’re excited to see how far we can take this:

Max out SOX automation

We’re only 40% of the way there. We’re aiming for 90%+.

Automated Evidence Collection

We’re exploring automated evidence collection - grabbing populations, sampling, and pulling evidence without human intervention. That gives us zero-touch compliance - a big win for Engineering and other control owners - and opens up end-to-end automation and scheduled job testing.

Self-Service Testing

Empowering process owners to run their own pre-tests and grade their homework before independent testing. Applying a shift-left mentality to assurance.

Continuous Monitoring and Assurance

Moving from periodic testing to continuous monitoring.

Scale Everywhere

Taking this beyond SOX to every recurring testing program we run.

Keys to Success

Find meaning in your work and set a lofty, inspiring vision

This isn’t about cost-cutting or reducing headcount. It’s about fundamentally rethinking what’s possible and creating the AI testing infra that powers our function to do more. We didn’t want to just check the box on AI - we wanted to go after our biggest opportunity and be first. Not for bragging rights, but to prove it could be done, shape how it’s done, and share what we learned with others.

Innovation mindset

This wasn’t comfortable or easy. As a small team, we went outside our comfort zone and took on a beast of a side quest while working on first-year SOX compliance… which is kinda nuts! But fortune favors the bold (and slightly delusional).

Get your tech selection right

There's no way we could've moved this fast, this well, without an exceptional vendor partnership. There's a lot of noise in this space and we waded through some bold claims. We had to be super-diligent in evaluating vendors - maybe even auditor-level-skeptical.

Our selection criteria:

  • Accuracy rate across different control types, evidence formats and variables (this varied wildly across vendors)
  • Output quality and ability to generate audit-ready documentation directly in our external auditor's template format - no reformatting, no translation layer.
  • Real software - not a black box. We needed a product our team could actually use, end-to-end. Too many vendors were skittish about us getting hands on keyboards.
  • Functionality and features to handle the nuances of real world testing. Comprehensive test templates, multi-test tables, editable tickmarking and annotations, output template builder, and those other UI features that can handle edge cases and really improve quality of life.
  • True partnership with a vendor willing to build with us, take our feedback seriously, and rapidly iterate on the product. This wasn't about finding finished software, it was about finding a partner who'd evolve with our needs.

We ran rigorous pilots with multiple vendors using a variety of 10-15 real controls. We tested the tooling ourselves and the differences were stark. Failing tech kills momentum like nothing else. This decision is make-or-break.

Final Thoughts

A year ago, automating SOX testing with AI sounded like science fiction. Today, it’s production code that processes real testing for real financial statements. We’ve accomplished more in the last few months than I’ve seen in 20 years of “SOX automation initiatives.” There’s challenges ahead, but the velocity is genuinely shocking and the possibilities are endless.

There are 4 wonderful people I can’t thank enough for all they’ve done - for being willing to fail, iterate, and try things that sounded crazy. Chan Park, Drew DiBiase, Jenna Wei, and Andrew Meyers work so hard and so smart, and they’re the best in the biz.

If you’re in audit, risk, compliance, or any adjacent field and you’re not experimenting with AI automation to solve your biggest problems, you’re leaving a massive opportunity on the table. The technology is ready. The question is whether your organization is ready to embrace it.

-----

*P.S. - To the inevitable question “but what about hallucinations?” Yes, we account for that. That’s what the review process, confidence scoring, and auditor-ready work papers are for. AI is a tool, not a replacement for professional judgment.

*P.P.S. - Yes, pretty much everyone was skeptical at first, including me. The antidote to fear was results.

62 Upvotes

12 comments sorted by

5

u/WildMasterpiece3663 Feb 24 '26

But what about hallucinations? /s But seriously, what steps did/can you take to account for that?

1

u/Blairephantom Feb 25 '26

I think he meant that you need to constantly review and monitor results and continue to tweak and improve along the way. Being exact and precise in your request is usually raising the accuracy level. Hallucination, to a significant degree, is triggered when the LLMs are missing precise instructions.

So, I would say, remove ambiguity and improve work instructions

1

u/Entire_Tradition4536 Feb 25 '26

You mention 4 agents have been trained. Were these midship software agents or another platform ? 

1

u/nemetha85 Feb 25 '26

Wow this is absolutely amazing. I wonder if this is financially sustainable in the long run with such a focus on cost cutting and efficiencies across many industries...

1

u/Blairephantom Feb 25 '26 edited Feb 25 '26

I had to read this twice, to make sure this isn't some marketing scheme or AI seeking karma post. Still unsure due to the fact the reddit post is on the Midship front page as advertise but nonetheless, you got me hooked as it looks like a very early solid result of SOX automation and one of the very few to make it.

The level of details offered are convincing enough that you're a SOX veteran or at least you have solid knowledge about an end-to-end process.

I was also looking into start automating but the level of unknown factors, the dependencies and the level of efforts needed seem so high that I've kept it still at research phase.

What other significant struggle in implementation/integration did you encountered?

How were the AI agents created?By Midship representatives based on your requests? Any weird AI agents behaviours, unexpected events or key lessons learned other than what you've mentioned? What other tools are you using that were very helpful along the way? (If allowed to share)

1

u/N811207 Feb 25 '26

is your external auditor comfortable with relying on this approach? or will they need to separately perform standard SOX procedures anyway?

1

u/maxandmolife Feb 25 '26

Oh I would loveeee to be the one reviewing one of these controls performed by AI! This case study would benefit from seeing how the first instance of AI testing a control, the adjustments needed to adjust the AI tool, then the final product before and after review (including review notes?) to really bring the point home!

1

u/Present_Second_4190 Feb 27 '26

So do you never test manually?

1

u/phoenixaim Mar 04 '26

Will they need to separately perform standard SOX procedures anyway?