Provider Matrix¶
Summary of all seven registered retrieval providers. HTTP details are in per-provider pages.
Source: src/retrieval/providers/registry.py, _PROVIDER_CLASSES.
Status overview¶
| Key | Status | Default enabled | Auth | Notes |
|---|---|---|---|---|
| openalex | Live | Yes | None | Used in CLI and full pipeline |
| semantic_scholar | Live | Yes | S2_API_KEY (optional) |
Used in CLI and full pipeline |
| arxiv | Live | No | None | Full pipeline / API only |
| crossref | Live | No | Mailto in User-Agent (recommended) | Full pipeline / API only |
pubmed |
Stub | No | — | NotImplementedError if enabled |
core |
Stub | No | — | NotImplementedError if enabled |
dblp |
Stub | No | — | NotImplementedError if enabled |
Stub providers
PubMed, CORE, and DBLP are registered but not implemented. Enabling them raises NotImplementedError in search(), which _safe_search() catches — you get a warning and empty results for that provider, not a crash.
Enable snippets¶
OpenAlex + Semantic Scholar (defaults — no change needed):
# config/providers.yaml
providers:
openalex:
enabled: true
semantic_scholar:
enabled: true
Add arXiv and CrossRef:
providers:
arxiv:
enabled: true
crossref:
enabled: true
RA_RETRIEVAL__PROVIDERS__ARXIV__ENABLED=true
RA_RETRIEVAL__PROVIDERS__CROSSREF__ENABLED=true
RA_CROSSREF_MAILTO=you@example.com
Do not enable stubs:
# Will warn and return no papers from this provider
pubmed:
enabled: true # not recommended until implemented
HTTP summary¶
| Provider | Method | Base endpoint | Search timeout | 429 handling |
|---|---|---|---|---|
| OpenAlex | GET | https://api.openalex.org/works |
60s | Retry only |
| Semantic Scholar | GET | https://api.semanticscholar.org/graph/v1/paper/search/bulk |
60s | Sleep Retry-After |
| arXiv | GET | https://export.arxiv.org/api/query |
60s | Retry only |
| CrossRef | GET | https://api.crossref.org/works |
60s | Sleep Retry-After |
All live providers: 3 retries, exponential backoff, health ping at 15s timeout.
Config keys used at runtime¶
| Key | Default | Used? |
|---|---|---|
retrieval.concurrency_limit |
4 |
Yes — caps parallel variant searches |
retrieval.per_provider_limit |
8 |
Yes — results per provider per variant |
retrieval.providers.<name>.enabled |
see table | Yes (full pipeline only) |
retrieval.providers.<name>.limit |
8 |
No — overridden by per_provider_limit |
Graceful degradation¶
flowchart TD
Q[Query variant] --> GATHER[asyncio.gather enabled providers]
GATHER --> P1[Provider search]
P1 -->|success| MERGE[Merge papers]
P1 -->|exception| W[Warning + empty list]
W --> MERGE
MERGE --> DEDUP[Deduplication]
Partial retrieval (some providers failed) still proceeds through the pipeline. Check stderr warnings or enable debug dumps to inspect per-provider failures.
See also: Retrieval overview, Environment variables.