Skip to content

Pipeline Stages

The research pipeline runs eleven stages in fixed order. Each stage implements the PipelineStage protocol (name + async run(ctx, data) → StageResult).

Built by build_pipeline() in src/retrieval/orchestrator.py. Registry keys in src/core/registry.py.

Stage order

query_understanding → query_expansion → retrieval → deduplication → ranking →
relevance_scoring → clustering → synthesis → gap_analysis → citation_export → report_generation

Summary table

# Stage Class LLM Timeout Deep dive
1 query_understanding QueryUnderstandingStage No 300 s
2 query_expansion QueryExpansionStage Optional 300 s
3 retrieval RetrievalStage No 300 s
4 deduplication DeduplicationStage No 300 s
5 ranking RankingStage No 300 s
6 relevance_scoring RelevanceScoringStage No 300 s
7 clustering ClusteringStage No 300 s
8 synthesis SynthesisStage Optional (two-pass) 600 s
9 gap_analysis GapAnalysisStage Optional 300 s
10 citation_export CitationExportStage No 300 s
11 report_generation ReportGenerationStage No 300 s

Default timeouts from pipeline.stage_timeout_seconds (300) and pipeline.synthesis_timeout_seconds (600).

Enable/disable

Each stage can be toggled via pipeline.enabled_stages.{stage_name} in YAML or environment. Disabled stages are skipped; downstream stages receive the last enabled stage's output unchanged. See Stage toggles.

Execution behavior

Mechanism Behavior
Sequential data chain Each stage output → next stage input
Artifact store Cross-stage shared state on PipelineContext
LLM resolution resolve_effective_settings() before first stage
Failure handling continue_on_stage_failure: true (default) — heuristic recovery
Timeout recovery src/core/stage_recovery.py — synthesis and gap_analysis have dedicated fallbacks
Progress events PipelineEventBus emits stage start/complete for stderr progress reporter

Stage recovery

Stage Recovery on timeout/failure
synthesis Heuristic extraction + synthesis from ranked papers
gap_analysis Heuristic gaps from synthesis fields + clusters
All others Return prior data unchanged

Data-flow diagram

flowchart LR
  Q[query: str] --> QU[query_understanding]
  QU -->|QueryUnderstandingResult| QE[query_expansion]
  QE -->|ExpandedQuerySet| RT[retrieval]
  RT -->|list RetrievedPaper| DD[deduplication]
  DD -->|list RetrievedPaper| RK[ranking]
  RK -->|list RankedPaper| RS[relevance_scoring]
  RS -->|list RankedPaper| CL[clustering]
  CL -->|list PaperCluster| SY[synthesis]
  SY -->|SynthesisResult| GA[gap_analysis]
  GA -->|GapAnalysisResult| CE[citation_export]
  CE -->|dict exports| RG[report_generation]
  RG -->|EnhancedResearchReport| OUT[output]

Config keys by stage

Stage Primary config sections
query_understanding
query_expansion query_expansion.*
retrieval retrieval.*, retrieval.providers.*
deduplication deduplication.*, embedding.*
ranking ranking.*, embedding.*
relevance_scoring relevance_scoring.*
clustering clustering.*, embedding.*, ranking.*
synthesis synthesis.*, llm.*
gap_analysis synthesis.llm_enabled (gates LLM — no separate gap flag)
citation_export
report_generation relevance_scoring.* (executive summary embedding floor)