DLP Test-File Generation¶

airs runtime dlp generate generates DLP test corpora — clean carrier files plus "dirty" copies with synthetic sensitive data embedded via multiple hiding techniques. Use it to measure how well a content scanner detects sensitive data across file formats and channels.

Synthetic data only

Every embedded value comes from a reserved / documented test range (reserved-range SSNs, Luhn-valid test PANs, example.com emails, 555-01xx phones, AWS …EXAMPLE keys). No real PII is ever produced.

Usage¶

airs runtime dlp generate [options]

Option	Default	Meaning
`--types <list>`	`all`	Comma list of `pdf,png,jpeg,svg,docx` (or `all`)
`--count <n>`	`1`	Clean files per type
`--out <dir>`	`./temp`	Output base directory
`--techniques <list>`	`all`	`all` or comma list of technique ids
`--seed <n>`	random	Seed for reproducible payloads
`--output <fmt>`	`pretty`	`pretty` or `json` summary

Auth: none — purely local file generation.

Output¶

<out>/
  clean/<type>/<base>.<ext>                 # benign carriers (true-negative controls)
  dirty/<type>/<base>__<technique>.<ext>    # one per (clean file × technique)
  manifest.json                             # dirty file -> technique + embedded values

Use manifest.json to score scanner hits/misses: it lists, per dirty file, the technique and the exact synthetic values embedded.

Techniques¶

Format	Technique ids
PDF	`meta`, `hidden-text`, `trailer`, `visible`, `visible-samecolor`
PNG	`text-chunks`, `trailer`, `stego-lsb`, `visible`
JPEG	`exif`, `com`, `trailer`, `visible`
SVG	`meta`, `hidden-text`, `comment`, `visible`
DOCX	`core-props`, `hidden-run`, `visible`, `visible-samecolor`

visible = rendered text with foreground ≠ background (genuinely visible / OCR-able). visible-samecolor (PDF & DOCX) = rendered body text drawn in the same color as its background — present and extractable, but camouflaged from the eye.

Examples¶

# Full corpus into ./temp
airs runtime dlp generate

# Images only, 3 each, reproducible
airs runtime dlp generate --types png,jpeg,svg --count 3 --seed 42

# Just PNG LSB steganography, JSON summary
airs runtime dlp generate --types png --techniques stego-lsb --output json