Sansec logo

Building a faster YARA engine in pure Go

Sansec

by Sansec Forensics Team

Published in Threat Research − February 18, 2026

We built a pure Go YARA engine that's 6.8x faster for text-based scanning, with no C dependencies. It now processes over 57,000 scans per day in production, and we're open-sourcing it today.

eComscan scan times before and after Yargo

eComscan scan times before and after Yargo

YARA is the industry standard for pattern matching in malware detection. Maintained by VirusTotal, it powers threat detection at nearly every security vendor. At Sansec, we rely on YARA for eComscan and our global threat monitor, scanning hundreds of thousands of stores daily.

But YARA was primarily designed for binary malware analysis, and wrapping its C library in Go was painful. We scan text files: PHP, JavaScript, HTML templates. So we built Yargo, a pure Go YARA engine optimized for source code.

How YARA works

A YARA rule defines string patterns and a condition that determines when the rule matches:

rule php_backdoor {
    strings:
        $assert = "assert"
        $serialize = "serialize"
        $session = "session"
    condition:
        all of them
}

Under the hood, YARA extracts short byte sequences ("atoms") from each pattern and loads them into an Aho-Corasick automaton, a state machine that can match thousands of patterns in a single pass over the input. When an atom matches, YARA verifies the full pattern at that position.

YARA extracts 4-byte atoms from each pattern and loads them into an Aho-Corasick automaton
Aho-Corasick automaton for the 4-byte atoms asse, seri, and sess. Teal nodes are match states, dashed arrows are failure links.

The structure starts as a trie: a tree where each edge represents one byte, and paths from root to node spell out the patterns. To search, you walk the trie byte by byte through the input.

The failure links (dashed arrows) are what make it fast: when a path doesn't match, instead of restarting from the root, the automaton jumps to the longest suffix that is also a prefix of another pattern. This means no byte is ever read twice.

This two-phase approach (cheap pre-filter, expensive verification) is what makes YARA fast enough to scan thousands of files against thousands of rules.

Low-hanging fruit

Before deciding to build a new engine, we first looked at optimizing our malware signature database.

We maintain a large list of burner domains (disposable domains used by attackers to host malware or exfiltrate sensitive data), currently over 12,000 entries. These were originally written using word boundary regexes (\b[domain]\b) instead of YARA's more efficient fullword modifier.

Switching to fullword string matches eliminated thousands of expensive regex verifications per scan. The performance improvement was significant, but with the signatures optimized, the bottleneck shifted to the engine itself.

Outgrowing go-yara

We use Go for most of our projects, so we relied on go-yara, the C bindings for libyara, for years. Two issues drove us to build an alternative.

First, CGo. It requires a C compiler, pkg-config, and a pre-installed libyara. Cross-compilation is painful enough that the go-yara docs need a dedicated guide just to explain it. CGo also prevents fully static binaries, one of Go's biggest advantages. We even maintained an ancient build server just to keep compatibility with merchants running older kernels.

Second, YARA's internals are optimized for binary malware analysis, not source code scanning.

Introducing Yargo

Yargo is a pure Go implementation of the YARA features we actually need. It follows the same architecture:

  1. Parse: goyacc-based LALR(1) parser turns YARA rules into an AST
  2. Compile: extract atoms from patterns, build the AC automaton
  3. Scan: AC pre-filter, regex verification, condition evaluation

The core improvements are all about making Aho-Corasick work better for text files.

Full string literals in the automaton

YARA truncates every atom to 4 bytes. A 19-byte pattern like eval(base64_decode( gets reduced to a single 4-byte substring, whichever scores highest. But any 4-byte sequence will match far more often than the full string.

Yargo puts entire string literals into the AC automaton, so that same pattern enters as all 19 bytes and only matches when the actual obfuscated code is present. This comes at the cost of a larger automaton, but the extra memory usage is in the order of megabytes.

Smarter regex atoms

YARA can use regex atoms as short as 1 byte. Depending on the rules and the input, this can lead to a lot of unnecessary verifications.

Yargo requires a minimum atom length of 3 bytes. With 256^3 (16.7 million) possible values, the chance of a false atom match drops dramatically. We had to adjust a small number of signatures to accommodate this, but the performance gain was well worth it.

Scoring atoms for source code

YARA scores atom quality generically: common bytes like 0x00 and 0x20 get penalized.

Yargo's quality function is tuned for web source code:

  • Common PHP/JS tokens (return, function, var, ();) are banned from atom selection entirely
  • Alphabetic bytes (very common in source code) score lower than non-ASCII bytes
  • The heuristic picks atoms that discriminate well in the kind of files we actually scan

Real-world byte frequencies

The AC pre-filter uses byte frequency data generated from real-world data. This means the scanner knows which bytes are actually rare in ecommerce codebases, enabling more effective skip-ahead during scanning.

Performance

eComscan runs over 57,000 scans per day. Yargo has been processing all of them since early February with zero issues.

The signature optimization alone cut median scan times nearly in half. Deploying Yargo cut them by another 6.8x: average scan time went from 12.5 minutes to under 2 minutes, and median scans now complete in under 1 minute.

In its first two weeks of production, Yargo has saved over 116,000 CPU-hours compared to the old engine, equivalent to over 13 years of continuous computation.

Future work

Yargo currently implements the subset of YARA that we need for our use cases. We're considering making it more backward compatible with YARA, so it can serve as a drop-in replacement for go-yara in other projects.

Yargo is available at github.com/sansecio/yargo under the MIT license.

Read more

Scan your store now
for malware & vulnerabilities

$ curl ecomscan.com | sh

eComscan is the most thorough security scanner for Magento, Adobe Commerce, Shopware, WooCommerce and many more.

Stay up to date with the latest eCommerce attacks

Sansec logo

experts in eCommerce security

Terms & Conditions
Privacy & Cookie Policy