Stage: clustering¶

Groups ranked papers into thematic clusters for synthesis and report structure.

Input / output¶

Direction	Type	Details
Input (`data`)	`list[RankedPaper]`	From relevance_scoring
Input (artifact)	`paper_embeddings`	Reused from ranking when available
Output (`data`)	`list[PaperCluster]`	Thematic groups
Artifacts written	`paper_clusters`	Read by synthesis, gap_analysis, report_generation

Primary algorithm: HDBSCAN on paper embedding vectors when sentence-transformers is available and embeddings exist.

Fallback: single keyword-based theme group when embeddings are unavailable or clustering produces no clusters.

Each cluster includes a theme label, summary, and list of member paper_ids.

Key	Purpose
`clustering.min_cluster_size`	HDBSCAN minimum cluster size
`clustering.min_samples`	HDBSCAN density parameter
`embedding.*`	Embedding model (if re-computing)
`ranking.*`	Used by `ensure_ranked_papers` adapter

No — HDBSCAN + keyword fallback.

pipeline.stage_timeout_seconds (default 300 s).

Empty input returns empty cluster list. On failure, returns prior data unchanged.