MusicAnalyser/backend/TECHNICAL_DOCS.md

# Technical Documentation: Stats & Narrative Services

## Overview
This document details the implementation of the core analysis engine (`StatsService`) and the AI narration layer (`NarrativeService`). These services transform raw Spotify listening data into computable metrics and human-readable insights.

## 1. StatsService (`backend/app/services/stats_service.py`)

The `StatsService` is a deterministic calculation engine. It takes a time range (`period_start` to `period_end`) and aggregates `PlayHistory` records.

### Core Architecture
- **Input:** SQLAlchemy Session, Start Datetime, End Datetime.
- **Output:** A structured JSON dictionary containing discrete analysis blocks (Volume, Time, Sessions, Vibe, etc.).
- **Optimization:** Uses `joinedload` to eagerly fetch `Track` and `Artist` relations, preventing N+1 query performance issues during iteration.

### Metric Logic

#### A. Volume & Consumption
- **Top Tracks/Artists:** Aggregated by ID, not name, to handle artist renames or duplicates.
- **Concentration Metrics:**
  - **HHI (Herfindahl–Hirschman Index):** Measures diversity. `SUM(share^2)`. Close to 0 = diverse, close to 1 = repetitive.
  - **Gini Coefficient:** Measures inequality of play distribution.
  - **Genre Entropy:** `-SUM(p * log(p))` for genre probabilities. Higher = more diverse genre consumption.
- **Artists:** Parsed from the `Track.artists` relationship (Many-to-Many) rather than the flat string, ensuring accurate counts for collaborations (e.g., "Drake, Future" counts for both).

#### B. Time & Habits
- **Part of Day:** Fixed buckets:
  - Morning: 06:00 - 12:00
  - Afternoon: 12:00 - 18:00
  - Evening: 18:00 - 23:59
  - Night: 00:00 - 06:00
- **Streaks:** Calculates consecutive days with at least one play.
- **Active Days:** Count of unique dates with activity.

#### C. Session Analytics
- **Session Definition:** A sequence of plays where the gap between any two consecutive tracks is ≤ 20 minutes. A gap > 20 minutes starts a new session.
- **Energy Arcs:** Compares the `energy` feature of the first and last track in a session.
  - Rising: Delta > +0.1
  - Falling: Delta < -0.1
  - Flat: Otherwise

#### D. The "Vibe" (Audio Features)
- **Aggregation:** Calculates Mean, Standard Deviation, and Percentiles (P10, P50/Median, P90) for all Spotify audio features (Energy, Valence, Danceability, etc.).
- **Whiplash Score:** Measures the "volatility" of a listening session. Calculated as the average absolute difference in a feature (Tempo, Energy, Valence) between consecutive tracks.
  - High Whiplash (> 15-20 for BPM) = Chaotic playlist shuffling.
  - Low Whiplash = Smooth transitions.
- **Profiles:**
  - **Mood Quadrant:** (Avg Valence, Avg Energy) coordinates.
  - **Texture:** Acousticness vs. Instrumentalness.

#### E. Context & Behavior
- **Context URI:** Parsed to determine source (Playlist vs. Album vs. Artist).
- **Context Switching:** Percentage of track transitions where the `context_uri` changes. High rate = user is jumping between playlists or albums frequently.

#### F. Lifecycle & Discovery
- **Discovery:** Tracks played in the current period that were *never* played before `period_start`.
- **Obsession:** Tracks with ≥ 5 plays in the current period.
- **Skip Detection (Boredom Skips):**
  - Logic: `(next_start - current_start) < (current_duration - 10s)`
  - Only counts if the listening time was > 30s (to filter accidental clicks).
  - Proxy for "User got bored and hit next."

---

## 2. NarrativeService (`backend/app/services/narrative_service.py`)

The `NarrativeService` acts as an interpreter. It feeds the raw JSON from `StatsService` into Google's Gemini LLM to generate text.

### Payload Shaping
To ensure reliability and manage token costs, the service **does not** send the raw full database dump. It pre-processes the stats:
- Truncates top lists to Top 5.
- Removes raw transition arrays.
- Simplifies nested structures.

### LLM Prompt Engineering
The system uses a strict persona ("Witty Music Critic") and enforces specific constraints:
- **Output:** Strict JSON.
- **Safety:** Explicitly forbidden from making mental health diagnoses (e.g., no "You seem depressed").
- **Content:** Must reference specific numbers from the input stats (e.g., "Your 85% Mainstream Score...").

### Output Schema
The LLM returns a JSON object with:
- `vibe_check`: 2-3 paragraph summary.
- `patterns`: List of specific observations.
- `persona`: A creative 2-3 word label (e.g., "The Genre Chameleon").
- `roast`: A playful critique.
- `era_insight`: Commentary on the user's "Musical Age" (weighted avg release year).

## 3. Data Models (`backend/app/models.py`)

- **Track:** Stores static metadata and audio features. `raw_data` stores the full Spotify JSON for future-proofing.
- **Artist:** Normalized artist entities. Linked to tracks via `track_artists` table.
- **PlayHistory:** The timeseries ledger. Links `Track` to a timestamp and context.
- **AnalysisSnapshot:** Stores the final output of these services.
  - `metrics_payload`: The JSON output of `StatsService`.
  - `narrative_report`: The JSON output of `NarrativeService`.