Stage: retrieval¶
Searches enabled scholarly providers concurrently and merges results.
| Class | RetrievalStage |
| Module | src/retrieval/retrieval_stage.py |
| Registry key | retrieval |
Input / output¶
| Direction | Type | Details |
|---|---|---|
Input (data) |
ExpandedQuerySet |
Original + variants + sub-questions searched |
| Input (artifact) | cached_papers |
Session cache bypass — skips provider calls |
Output (data) |
list[RetrievedPaper] |
Merged, deduplicated at provider level |
| Artifacts written | retrieved_papers |
Used by synthesis recovery |
Behavior¶
- If
cached_papersartifact exists, validates and returns cached papers immediately (cache hit). - Otherwise, builds query list from
ExpandedQuerySet(original + variants + sub-questions). - For each enabled provider in config, searches all queries concurrently (bounded by
concurrency_limit). - Provider failures are collected as warnings; partial results continue.
- All provider results are merged into a single list.
Per-provider limit caveat
The stage always passes settings.retrieval.per_provider_limit to provider.search(). Per-provider limit values in YAML are ignored at the stage level.
Configuration¶
| Key | Purpose |
|---|---|
retrieval.concurrency_limit |
Max concurrent provider requests |
retrieval.per_provider_limit |
Papers per provider per query |
retrieval.providers.{name}.enabled |
Enable/disable each provider |
See Provider matrix for live vs stub providers.
LLM¶
No.
Timeout¶
pipeline.stage_timeout_seconds (default 300 s).
Recovery¶
On total failure, returns empty list with partial=True and warning.
Metrics¶
papers_foundproviders_failedcache_hit(when applicable)