Refactor Stats and Narrative services to match spec

- StatsService: Fixed N+1 queries, added missing metrics (whiplash, entropy, lifecycle), and improved correctness (boundary checks, null handling). - NarrativeService: Added payload shaping for token efficiency, improved JSON robustness, and updated prompts to align with persona specs. - Documentation: Added backend/TECHNICAL_DOCS.md detailing the logic.
2026-02-25 11:46:07 +00:00 · 2025-12-25 18:12:05 +04:00
parent 508d001d7e
commit af0d985253
3 changed files with 410 additions and 202 deletions
--- a/backend/TECHNICAL_DOCS.md
+++ b/backend/TECHNICAL_DOCS.md
@@ -0,0 +1,95 @@
+# Technical Documentation: Stats & Narrative Services
+
+## Overview
+This document details the implementation of the core analysis engine (`StatsService`) and the AI narration layer (`NarrativeService`). These services transform raw Spotify listening data into computable metrics and human-readable insights.
+
+## 1. StatsService (`backend/app/services/stats_service.py`)
+
+The `StatsService` is a deterministic calculation engine. It takes a time range (`period_start` to `period_end`) and aggregates `PlayHistory` records.
+
+### Core Architecture
+- **Input:** SQLAlchemy Session, Start Datetime, End Datetime.
+- **Output:** A structured JSON dictionary containing discrete analysis blocks (Volume, Time, Sessions, Vibe, etc.).
+- **Optimization:** Uses `joinedload` to eagerly fetch `Track` and `Artist` relations, preventing N+1 query performance issues during iteration.
+
+### Metric Logic
+
+#### A. Volume & Consumption
+- **Top Tracks/Artists:** Aggregated by ID, not name, to handle artist renames or duplicates.
+- **Concentration Metrics:**
+  - **HHI (Herfindahl–Hirschman Index):** Measures diversity. `SUM(share^2)`. Close to 0 = diverse, close to 1 = repetitive.
+  - **Gini Coefficient:** Measures inequality of play distribution.
+  - **Genre Entropy:** `-SUM(p * log(p))` for genre probabilities. Higher = more diverse genre consumption.
+- **Artists:** Parsed from the `Track.artists` relationship (Many-to-Many) rather than the flat string, ensuring accurate counts for collaborations (e.g., "Drake, Future" counts for both).
+
+#### B. Time & Habits
+- **Part of Day:** Fixed buckets:
+  - Morning: 06:00 - 12:00
+  - Afternoon: 12:00 - 18:00
+  - Evening: 18:00 - 23:59
+  - Night: 00:00 - 06:00
+- **Streaks:** Calculates consecutive days with at least one play.
+- **Active Days:** Count of unique dates with activity.
+
+#### C. Session Analytics
+- **Session Definition:** A sequence of plays where the gap between any two consecutive tracks is ≤ 20 minutes. A gap > 20 minutes starts a new session.
+- **Energy Arcs:** Compares the `energy` feature of the first and last track in a session.
+  - Rising: Delta > +0.1
+  - Falling: Delta < -0.1
+  - Flat: Otherwise
+
+#### D. The "Vibe" (Audio Features)
+- **Aggregation:** Calculates Mean, Standard Deviation, and Percentiles (P10, P50/Median, P90) for all Spotify audio features (Energy, Valence, Danceability, etc.).
+- **Whiplash Score:** Measures the "volatility" of a listening session. Calculated as the average absolute difference in a feature (Tempo, Energy, Valence) between consecutive tracks.
+  - High Whiplash (> 15-20 for BPM) = Chaotic playlist shuffling.
+  - Low Whiplash = Smooth transitions.
+- **Profiles:**
+  - **Mood Quadrant:** (Avg Valence, Avg Energy) coordinates.
+  - **Texture:** Acousticness vs. Instrumentalness.
+
+#### E. Context & Behavior
+- **Context URI:** Parsed to determine source (Playlist vs. Album vs. Artist).
+- **Context Switching:** Percentage of track transitions where the `context_uri` changes. High rate = user is jumping between playlists or albums frequently.
+
+#### F. Lifecycle & Discovery
+- **Discovery:** Tracks played in the current period that were *never* played before `period_start`.
+- **Obsession:** Tracks with ≥ 5 plays in the current period.
+- **Skip Detection (Boredom Skips):** 
+  - Logic: `(next_start - current_start) < (current_duration - 10s)`
+  - Only counts if the listening time was > 30s (to filter accidental clicks).
+  - Proxy for "User got bored and hit next."
+
+---
+
+## 2. NarrativeService (`backend/app/services/narrative_service.py`)
+
+The `NarrativeService` acts as an interpreter. It feeds the raw JSON from `StatsService` into Google's Gemini LLM to generate text.
+
+### Payload Shaping
+To ensure reliability and manage token costs, the service **does not** send the raw full database dump. It pre-processes the stats:
+- Truncates top lists to Top 5.
+- Removes raw transition arrays.
+- Simplifies nested structures.
+
+### LLM Prompt Engineering
+The system uses a strict persona ("Witty Music Critic") and enforces specific constraints:
+- **Output:** Strict JSON.
+- **Safety:** Explicitly forbidden from making mental health diagnoses (e.g., no "You seem depressed").
+- **Content:** Must reference specific numbers from the input stats (e.g., "Your 85% Mainstream Score...").
+
+### Output Schema
+The LLM returns a JSON object with:
+- `vibe_check`: 2-3 paragraph summary.
+- `patterns`: List of specific observations.
+- `persona`: A creative 2-3 word label (e.g., "The Genre Chameleon").
+- `roast`: A playful critique.
+- `era_insight`: Commentary on the user's "Musical Age" (weighted avg release year).
+
+## 3. Data Models (`backend/app/models.py`)
+
+- **Track:** Stores static metadata and audio features. `raw_data` stores the full Spotify JSON for future-proofing.
+- **Artist:** Normalized artist entities. Linked to tracks via `track_artists` table.
+- **PlayHistory:** The timeseries ledger. Links `Track` to a timestamp and context.
+- **AnalysisSnapshot:** Stores the final output of these services.
+  - `metrics_payload`: The JSON output of `StatsService`.
+  - `narrative_report`: The JSON output of `NarrativeService`.