# PodcastIQ — Technical Overview, UX, and Current Status

*What the system is. How a host uses it. How the pieces fit. Where the build stands today.*

---

## Scope of This Document

This is the product- and architecture-level view of PodcastIQ for readers who want more than the summary but do not need the full internal decision log or the per-phase task plan. Proprietary implementation details — specific provider rankings, per-agent prompt structure, internal benchmarks, and the exact escape-valve logic that gates each layer — live in the private decision log. What follows is the shape of the system, the host-facing UX, and where v0.1 currently sits in the pipeline.

---

## The Problem, Stated Precisely

Long-form interview podcasts and live audio (Spaces-style) shows have two persistent failure modes:

1. **Claim drift.** A guest makes a specific, checkable claim. The host either takes it at face value, hesitates while trying to recall counter-evidence, or loses the thread entirely chasing it internally. The moment passes, and the claim is on the record unchallenged.
2. **Preparation asymmetry.** A host who prepares deeply for one guest still cannot cover every adjacent thread the conversation will open. A host who prepares lightly gets dragged by the guest's agenda.

Both problems are structural features of live conversation, not skill issues. PodcastIQ is built around the observation that the cognitive bottleneck belongs to the host, and that grounded LLM tooling can run fast enough to relieve it without becoming a distraction.

---

## System Shape

PodcastIQ runs locally on the host's workstation. A single process launches the audio capture loop, the transcription pipeline, the agent harness, the knowledge base, and a local web server that serves the host's display surfaces. No external backend is required. Everything the system produces — transcripts, dossiers, agent outputs, knowledge base entries — is written to local storage by default.

The host supplies their own provider keys for transcription, search, and LLM inference. The reference implementation ships with opinionated defaults at each layer, but every layer is swappable without touching the code that sits above or below it.

Five layers, each with a clean interface boundary:

| Layer | Responsibility |
|---|---|
| Capture | Microphone capture, voice activity detection, utterance segmentation |
| Transcription | Streaming speech-to-text with speaker diarization |
| Agents | Four hotkey-triggered research agents with grounded tool access |
| Knowledge | Embedded vector store for dossiers, transcripts, and host notes |
| Surface | Local server driving three coordinated browser-based views |

The system deliberately avoids a monolithic "smart assistant" shape. The agents are specific; the surfaces are specific; the hotkeys are specific. The host does not converse with the system — they trigger it.

---

## The Four Agents

Each agent has a fixed role and a dedicated hotkey. Each receives a structured slice of recent transcript plus retrieval context from the host's knowledge base, issues search or retrieval calls as needed, and streams a constrained response back to the host's live view.

**Fact-Check.** Targets a specific claim in the recent transcript. Returns a concise verdict grounded in retrieved sources. Designed to be scan-readable mid-conversation.

**Counterpoint.** Constructs the strongest reasonable argument against the current line of discussion. Intended to let the host represent an opposing view in real time without having to hold the full opposing case in their head.

**Follow-Up.** Proposes the next high-signal question based on what the guest has just said and what the host has not yet asked. Reduces the interview-planning burden during the show itself.

**Guest Research.** Surfaces context on the current guest — recent public statements, prior interviews, positions, affiliations. This is the live-lookup sibling of the pre-show dossier builder.

Each agent is grounded: responses cite the transcript window and the retrieval context they used, so the host can evaluate the source before acting on the suggestion.

---

## UX and Host-Facing Flows

### Flow 1: Dashboard and Setup

The dashboard is the host's configuration and launch surface. It handles:

- Provider key management (BYOK for transcription, search, and LLM).
- Audio device selection and capture test.
- Session launch — the host names the session, optionally attaches a dossier, and starts the live view.
- Knowledge base browsing — prior episodes, dossiers, notes, and any other artifacts the host has ingested.
- Telemetry panels for recent sessions.

The dashboard is deliberately boring. Setup is a pre-show activity; it is not where the product's value lives.

### Flow 2: The Pre-Show Dossier

Before a show, the host can point the dossier builder at a guest and let it assemble a structured brief: who the guest is, what they have said recently, what recurring claims or positions they hold, what open threads exist from prior appearances. The dossier is stored in the knowledge base and attached to the session so the agents can draw on it live.

The dossier is explicitly an asynchronous product. It runs in a prep window, not during the show. Its role on-air is to be available to the retrieval layer, not to be read line-by-line.

### Flow 3: The Live Session

When the host starts a session, the live view opens. The layout is:

- **Transcript pane.** Streaming speaker-diarized transcript, auto-scrolling, with the most recent utterances anchored. Older utterances remain in-view for hotkey targeting.
- **Agent panels.** Four surfaces, one per agent. A panel fills when its hotkey fires and the agent streams its response.
- **Status bar.** Capture health, transcription lag, and the current session's dossier attachment.

The host's primary interaction during the show is the keyboard. A single keystroke triggers the corresponding agent against the most recent transcript window. The agent's response streams into its panel within a latency budget tight enough to be useful before the conversation moves on. The host reads, decides whether to use it, and returns attention to the guest.

There is no chat box. There is no prompt field. The system is designed so that the host is not formulating queries during the show.

### Flow 4: OBS Browser Source

For hosts who stream, a compact browser-source view is available at a separate local URL. It is styled for overlay composition: chroma-friendly backgrounds, tight typography, and restrained motion. It reflects the same agent output surface as the live view, so a host operating in streaming mode gets the same research surface readable inside their broadcast scene.

### Flow 5: Post-Show Ingest

When a session ends, the transcript, the agent outputs, and any host annotations are ingested into the knowledge base. Future sessions — including future episodes with the same guest — draw on this history through the retrieval layer. The back catalogue compounds.

---

## Key Management and Privacy Posture

- Provider keys are stored in the host's local configuration. They are never transmitted anywhere the host did not explicitly target (i.e., the provider the key is for).
- The local server binds to loopback only. External network exposure is off by default and is not a supported configuration in v0.1.
- Transcripts and dossiers are written to the local filesystem in a location the host controls.
- The host is responsible for their own backup posture. The system does not push artifacts to cloud storage automatically.

This is an opinionated stance. The product's credibility with hosts who treat their research as sensitive IP depends on the system behaving as a local tool, not a cloud tool with a local UI.

---

## Architectural Invariants

A small number of rules hold across every layer. These are locked in the project's decision log.

- **Provider-agnostic at every layer.** No layer's public interface mentions a specific vendor. Swapping transcription, search, or LLM is an adapter change, not a refactor.
- **Latency budgets are enforced at the system level.** Each agent has a locked first-token and full-response target. Regressions are a release-blocking issue, not a tuning note.
- **Grounding is mandatory.** Agent responses cite their transcript window and retrieval context. An agent that cannot ground its answer returns a structured non-answer rather than a plausible guess.
- **Bi-directional knowledge base.** Everything the host ingests feeds the retrieval layer. Everything the system produces feeds back into the knowledge base. The host owns both ends.
- **v0.1 is Windows-only.** macOS and Linux ports are deferred to post-release. This is a deliberate scope choice, not a technical limitation.

---

## What v0.1 Ships With

- The four agents (Fact-Check, Counterpoint, Follow-Up, Guest Research).
- The pre-show dossier builder.
- The knowledge base with bi-directional ingest.
- Dashboard, live session view, and OBS browser source surfaces.
- A demo video as part of the release gate.
- Git-clone-and-run distribution. No installer in v0.1.

---

## What v0.1 Explicitly Does Not Ship

- A browser extension. Deferred to v0.2.
- Packaged installers (.exe, .dmg). Deferred until the release pipeline justifies them.
- macOS or Linux support. Community contributions welcomed after v0.1.
- Multi-host or multi-session orchestration. Single-host, single-session is the v0.1 target.
- A hosted cloud tier. Local-first is a product stance, not a staging point.

---

## Current Status

The project is mid-build against a phased, autonomous task plan. Snapshot of where the work sits:

- **Architecture and decision log:** Locked. Tech stack choices are committed, with recorded escape valves for any layer that fails validation.
- **Agent harness and four agents:** Implemented against the locked interface. Grounded response pipeline wired through to the retrieval layer.
- **Pre-show dossier builder:** Implementation complete. Structured dossier artifacts land in the knowledge base and attach cleanly to sessions.
- **Knowledge base:** Embedded vector store operational. Bi-directional ingest path exercised.
- **Dashboard:** Settings editing and save, dossier CRM form with auto-research, KB file explorer, and session rating loop are live. Agent composition chains and UI polish are landed.
- **Live session view:** Scaffolded and rendering streamed agent output. Hotkey integration in place.
- **OBS browser source:** Route live; overlay styling pass in progress.
- **Observability, tone learning, error-recovery E2E:** Scoped; activation of the CI latency-regression gate lands in this phase.
- **Release:** Pending. Demo video is part of the Done definition.

The remaining build concentrates on the surfaces reaching their final polish, the latency gate being enforced end-to-end, and the release artifacts being assembled.

---

## Licensing

PodcastIQ ships under **AGPL-3.0 with a DCO sign-off workflow**. This preserves a commercial dual-license pathway while keeping the reference implementation open for the host community. There is no CLA.

---

## What This Document Intentionally Omits

Internal provider rankings, per-agent prompt design, specific latency benchmarks, the exact escape-valve logic per layer, the model-routing policy for sub-agent orchestration during the build, and the detailed phased task plan. Those belong in the project's private decision log and are not part of the public concept surface.