Skip to content

OpenAlex

OpenAlex is the primary bibliographic index provider — enabled by default and used in both the CLI shortcut path and the full pipeline.

Implementation: src/retrieval/providers/openalex.py

HTTP API

Attribute Value
Search URL GET https://api.openalex.org/works
Query params search=<query>, per-page=<limit>
Authentication None required
Timeout 60s (search), 15s (health)
Retries 3 with exponential backoff

Example request

GET https://api.openalex.org/works?search=transformer+attention&per-page=8

No API key or mailto is required. OpenAlex is suitable for broad scholarly coverage including DOIs, venues, and citation counts.

Normalization

OpenAlex work JSON maps to RetrievedPaper:

OpenAlex field RetrievedPaper field
display_name title
abstract_inverted_index abstract (reconstructed)
publication_year year
primary_location.source.display_name venue
id or landing page URL url
doi doi
cited_by_count citation_count

Full raw JSON is preserved in raw_metadata.

Configuration

Default: enabled in config/default.yaml and config/providers.yaml.

retrieval:
  providers:
    openalex:
      enabled: true
RA_RETRIEVAL__PROVIDERS__OPENALEX__ENABLED=true
RA_RETRIEVAL__PER_PROVIDER_LIMIT=10

Health check

_ping() issues a minimal per-page=1 query against the works endpoint. Used by setup health checks when retrieval providers are validated.

Operational notes

  • No explicit 429 handler — relies on retry + backoff
  • Rate limits are generous for polite use; avoid tight loops in batch scripts
  • OpenAlex IDs (https://openalex.org/W...) are used as paper URLs when no landing page exists

See also: Provider matrix, Retrieval overview.