AI-First Cyber Threat Intelligence Pipeline
State machine
Each article walks an explicit sequence of states (fetched, titled, classified, sourced, enriched, vetted, stored), observable end-to-end.
Schema-first
Actors, malware, CVEs, indicators, MITRE ATT&CK and victimology come out as typed JSON.
Story-aware
Daily clusters, cross-day similarity links, and multi-day super-clusters reconstruct evolving narratives.
LLM-adjudicated
An adjudicator agent breaks the tie between "duplicate" and "ongoing story" when deterministic similarity is ambiguous.
7-signal
Source diversity, size, CVE/actor/IOC density, temporal persistence and content category combine into a single 0–100 score, with no extra API calls.
Operable
Cost tracking per call, structured run/step logs, manual reprocess path, and dashboards built around the same data model.
Problem
Cybersecurity teams operate on top of a constant stream of unstructured reporting: vendor blogs, advisories, news, intel feeds. Raw data is noisy, duplicated, and inconsistent. Generic summarization loses the entities and relationships that make intelligence actionable, and naïve deduplication either fragments stories into isolated articles or collapses unrelated events into a single noisy bucket. Producing structured, reliable, queryable intelligence at scale takes more than summarization or deduplication.
Approach
I designed an end-to-end system built around two explicit phases: a per-item state machine that walks each article through filtering, source resolution, schema-first enrichment and vetting; and a nightly post-processing chain that normalizes entities, rebuilds daily clusters, links them across time into multi-day stories, lets an LLM adjudicator settle gray-zone “duplicate vs ongoing” calls, scores everything by importance, and pushes the resulting intelligence into a central store and downstream sinks. I keep specialist responsibilities (title rewriting, relevance filtering, source resolution, structured extraction, secondary vetting) as separate agents so each stays small, testable, and replaceable.
Key Design Decisions
- Per-item state machine with persisted state transitions makes the pipeline resumable, auditable, and easy to re-drive on a single URL.
- Early relevance rejection happens before any expensive enrichment runs, keeping spend tied to articles that matter.
- Schema-first enrichment emits typed entities (threat actors, malware, CVEs with CVSS, IOCs, MITRE ATT&CK, victimology) instead of free-form prose. Every downstream stage consumes structure.
- Entity normalization applies curated alias mappings so the same actor, malware family, or product surfaces under one canonical name across sources.
- Daily clustering + temporal links + super-clusters reconstruct multi-day campaigns instead of leaving analysts with isolated articles.
- LLM adjudicator on gray-zone temporal links tie-breaks borderline “duplicate vs ongoing story” decisions that deterministic similarity cannot resolve, with verdict, confidence and rationale persisted alongside the link.
- Local 7-signal importance scoring (source diversity, cluster size, CVE/actor/IOC presence, temporal persistence, content category) ranks stories without any extra API call.
- Operational discipline: per-call cost tracking, structured run/step logs, soft-deletes, idempotent migrations and a single manual reprocess path that mirrors the nightly chain.
Outcome
The pipeline turns raw reporting into a queryable intelligence layer: ranked, deduplicated, and stitched into evolving stories that analysts can drill into by actor, malware, CVE, sector, or geography. IOC, CVE and morning-briefing exports feed downstream consumers.
Default is left-to-right. Click the diagram or use the toggle to switch views.