Pipeline Stages¶
The research pipeline runs eleven stages in fixed order. Each stage implements the PipelineStage protocol (name + async run(ctx, data) → StageResult).
Built by build_pipeline() in src/retrieval/orchestrator.py. Registry keys in src/core/registry.py.
Stage order¶
query_understanding → query_expansion → retrieval → deduplication → ranking →
relevance_scoring → clustering → synthesis → gap_analysis → citation_export → report_generation
Summary table¶
| # | Stage | Class | LLM | Timeout | Deep dive |
|---|---|---|---|---|---|
| 1 | query_understanding | QueryUnderstandingStage |
No | 300 s | → |
| 2 | query_expansion | QueryExpansionStage |
Optional | 300 s | → |
| 3 | retrieval | RetrievalStage |
No | 300 s | → |
| 4 | deduplication | DeduplicationStage |
No | 300 s | → |
| 5 | ranking | RankingStage |
No | 300 s | → |
| 6 | relevance_scoring | RelevanceScoringStage |
No | 300 s | → |
| 7 | clustering | ClusteringStage |
No | 300 s | → |
| 8 | synthesis | SynthesisStage |
Optional (two-pass) | 600 s | → |
| 9 | gap_analysis | GapAnalysisStage |
Optional | 300 s | → |
| 10 | citation_export | CitationExportStage |
No | 300 s | → |
| 11 | report_generation | ReportGenerationStage |
No | 300 s | → |
Default timeouts from pipeline.stage_timeout_seconds (300) and pipeline.synthesis_timeout_seconds (600).
Enable/disable¶
Each stage can be toggled via pipeline.enabled_stages.{stage_name} in YAML or environment. Disabled stages are skipped; downstream stages receive the last enabled stage's output unchanged. See Stage toggles.
Execution behavior¶
| Mechanism | Behavior |
|---|---|
Sequential data chain |
Each stage output → next stage input |
| Artifact store | Cross-stage shared state on PipelineContext |
| LLM resolution | resolve_effective_settings() before first stage |
| Failure handling | continue_on_stage_failure: true (default) — heuristic recovery |
| Timeout recovery | src/core/stage_recovery.py — synthesis and gap_analysis have dedicated fallbacks |
| Progress events | PipelineEventBus emits stage start/complete for stderr progress reporter |
Stage recovery¶
| Stage | Recovery on timeout/failure |
|---|---|
| synthesis | Heuristic extraction + synthesis from ranked papers |
| gap_analysis | Heuristic gaps from synthesis fields + clusters |
| All others | Return prior data unchanged |
Data-flow diagram¶
flowchart LR
Q[query: str] --> QU[query_understanding]
QU -->|QueryUnderstandingResult| QE[query_expansion]
QE -->|ExpandedQuerySet| RT[retrieval]
RT -->|list RetrievedPaper| DD[deduplication]
DD -->|list RetrievedPaper| RK[ranking]
RK -->|list RankedPaper| RS[relevance_scoring]
RS -->|list RankedPaper| CL[clustering]
CL -->|list PaperCluster| SY[synthesis]
SY -->|SynthesisResult| GA[gap_analysis]
GA -->|GapAnalysisResult| CE[citation_export]
CE -->|dict exports| RG[report_generation]
RG -->|EnhancedResearchReport| OUT[output]
Config keys by stage¶
| Stage | Primary config sections |
|---|---|
| query_understanding | — |
| query_expansion | query_expansion.* |
| retrieval | retrieval.*, retrieval.providers.* |
| deduplication | deduplication.*, embedding.* |
| ranking | ranking.*, embedding.* |
| relevance_scoring | relevance_scoring.* |
| clustering | clustering.*, embedding.*, ranking.* |
| synthesis | synthesis.*, llm.* |
| gap_analysis | synthesis.llm_enabled (gates LLM — no separate gap flag) |
| citation_export | — |
| report_generation | relevance_scoring.* (executive summary embedding floor) |