Skip to content

arXiv

arXiv covers preprints and e-prints, especially strong for CS, ML, and physics. Disabled by default — enable for the full pipeline or API (not the CLI batch shortcut).

Implementation: src/retrieval/providers/arxiv.py

HTTP API

Attribute Value
Search URL GET https://export.arxiv.org/api/query
Query params search_query, start=0, max_results
Response format Atom XML
Authentication None
Timeout 60s (search), 15s (health)
Retries 3 with exponential backoff

Query syntax

The provider preserves arXiv field syntax when present:

User query search_query sent
ti:attention abs:transformer unchanged
transformer attention all:transformer attention

Field prefixes: ti: (title), abs: (abstract), au: (author), etc.

Example request

GET https://export.arxiv.org/api/query?search_query=all:transformer+attention&start=0&max_results=8

Normalization

Atom entries parse to RetrievedPaper:

Atom element RetrievedPaper field
title title
summary abstract
published year year
arxiv:primary_category venue (category label)
PDF link or id url
arxiv:doi if present doi
Author list authors

Configuration

retrieval:
  providers:
    arxiv:
      enabled: true
RA_RETRIEVAL__PROVIDERS__ARXIV__ENABLED=true

CLI limitation

python -m src "query" does not enable arXiv. Use interactive mode, the API, or run_research() with custom settings.

Health check

Search with max_results=1, 15s timeout.

Operational notes

  • arXiv requests a 3-second delay between calls in their terms of use — the pipeline's concurrency limit helps avoid hammering the API
  • Preprint-heavy results may rank differently than peer-reviewed venues from OpenAlex/CrossRef
  • Good complement when researching very recent ML work not yet indexed elsewhere

See also: Provider matrix, Configuration cookbook.