LANL Dataset Replay
The seerflow.lanl module ships a parser, converter, host-mapper, and validator for the LANL Unified Host and Network Dataset v2. It exists for one reason: let you verify that the correlation engine actually catches the red-team activity the dataset labels as malicious.
If you contribute to the detection or correlation code, run LANL validation before opening a PR. It is the closest thing Seerflow has to an end-to-end ground-truth benchmark.
What the dataset is
Section titled “What the dataset is”Three event types + one label file:
| File | Content |
|---|---|
auth.txt / redauth.txt | Authentication events (success / failure) |
proc.txt / redproc.txt | Process start / stop events |
flows.txt / redflows.txt | Network flow events |
redteam.txt | Hand-labeled red-team compromise events (ground truth) |
The dataset is gzip-compressed CSV. Records use anonymized identifiers (U13, C457, etc.) for users and hosts.
from seerflow.lanl import ( parse_auth_line, parse_flow_line, parse_proc_line, parse_redteam_line, convert_auth_record, convert_flow_record, convert_proc_record, host_to_ip, run_validation,)Parsing
Section titled “Parsing”from pathlib import Pathimport gzip
with gzip.open(Path("auth.txt.gz"), "rt") as fh: for line in fh: rec = parse_auth_line(line) # rec is a frozen AuthRecordparse_*_line returns frozen, slotted dataclasses safe to share across threads and use as dict / set keys. They never allocate beyond the record itself, so streaming a multi-GB file is constant memory.
Conversion to SeerflowEvent
Section titled “Conversion to SeerflowEvent”from seerflow.lanl import convert_auth_record
event = convert_auth_record(rec)# event is a SeerflowEvent ready to feed into the pipelineThe converter handles the mapping from LANL’s anonymized identifiers to deterministic UUID5 entity IDs via host_to_ip, so the same C457 always resolves to the same entity across runs.
Validation
Section titled “Validation”from seerflow.lanl import run_validation
result = run_validation( auth_path="data/auth.txt.gz", proc_path="data/proc.txt.gz", flow_path="data/flows.txt.gz", redteam_path="data/redteam.txt.gz",)print(result.precision, result.recall, result.f1)print(result.detected_compromise_count, "/", result.total_compromise_count)ValidationResult aggregates precision, recall, F1, and per-entity detection traces against the red-team labels. Use it to flag regressions in the correlation engine or in Sigma rules.
There is no top-level seerflow lanl command (the dataset workflow is for contributors, not operators). Drive it from a Python script in scripts/ or a notebook.
See also
Section titled “See also”src/seerflow/lanl/parser.py— record typessrc/seerflow/lanl/converter.py—SeerflowEventmappingsrc/seerflow/lanl/validator.py— metric computation- LANL paper: Turcotte et al., Unified Host and Network Data Set, 2017