Roxels/ docs
concepts

Conversation lifecycle

This page walks one conversation end to end — from "the session is created" to "the outputs are delivered." Read it once after you've integrated, when you want a clear picture of how the pieces fit.

The five phases

1. Session creation     →  the conversation has an id; no one has joined yet
2. Join + prewarm       →  the participant lands on the URL; the agent gets ready
3. Greeting + bridge    →  the first two utterances; the agent is warming the cache
4. The conversation     →  the agent pursues goals; data extracts; outputs fire
5. Wrap-up + outputs    →  the conversation ends; final outputs deliver

1. Session creation

A session is the runtime instance of a template. Two ways to create one:

  • Embed key from the browser. The embed calls Roxels with an rk_ key + template id. The session is created and the browser gets a session id.
  • API call from your backend. POST /v1/interviews with template_id and persons. Your backend gets a session id and a join URL. You either share the URL or pass the session id to the embed.

At this point, the session has:

  • An id (sess_…).
  • A bound template version (the version live at creation time — editing the template later won't affect this session).
  • The participant identity, if you provided one.
  • The conversation context, if you provided any.

No agent, no LiveKit room, no microphone yet.

2. Join + prewarm

When the embed mounts or the participant visits the join URL:

  1. The iframe loads (/embed/<sessionId>).
  2. Status: loading — the iframe is hydrating.
  3. Status: ready — the LiveKit room credentials are fetched; we're ready to join.
  4. Status: joining — the LiveKit connection is being established.
  5. Status: connected — the call is live.

During loading and ready, the system is prewarming the agent's working brief. The full system prompt (template instructions, archetype, skills, advisor briefs, the global system prompt) is large — first-token latency is significantly faster if the prompt is cached. Prewarming gets the cache hot before the user speaks.

3. Greeting + bridge

The first two agent utterances are deliberate:

  • Turn 0: the canned greeting. A short, hardcoded phrase ("Hi there!" / "Hello!"). Played the instant the call connects, with zero latency. Whoever's listening hears a friendly voice immediately — not the dead air of an LLM thinking.
  • Turn 1: the bridge. A short, LLM-generated pleasantry that responds to whatever the user said in reaction to the greeting ("Great, thanks for hopping on" / "Of course, take your time"). This runs through a small, fast model, not the full speaker LLM, so it's quick.

By the time turn 2 happens (the first "real" turn driven by the full system prompt), the cache is fully warm. The user has heard two friendly utterances and feels heard.

This is why Roxels-led conversations don't feel like talking to a robot from second one — the latency hammock is invisible.

4. The conversation

The agent now pursues goals.

Per-turn flow

For each turn:

  1. User speaks. STT (speech-to-text) transcribes in real time.
  2. EOU detection. A small specialized model decides when the user is done. The agent doesn't wait too long, but doesn't cut the user off.
  3. Advisor concerns updated. Observer agents (advisors) read the latest turn and emit any concerns ("the user just said they're a team of 12 but no team_size was extracted").
  4. Speaker thinks. The speaker LLM receives the conversation history, the advisor concerns, and the current goal state. It decides what to say or which tool (skill) to call.
  5. Tools called (if any). If the speaker decided to invoke a skill, it runs; results feed back.
  6. Speaker responds. TTS (text-to-speech) streams audio back.
  7. Extraction runs. A separate process reads the turn and updates the working understanding of any in-progress goals.

This all happens fast. The participant feels a continuous conversation.

When a goal commits

The speaker decides a goal is satisfied. On commit:

  1. The goal data is finalized. The extraction layer's working understanding is committed to the conversation's findings.
  2. Outputs fire — all webhooks, frontend callbacks, and chained API calls configured for that goal. See Webhooks overview and Embed events.
  3. goal_transition fires — frontend event and (sometimes) advisor signal.
  4. The agent moves on — to the next goal in the phase, or to wrap-up if this was the last.

Screen-share (if enabled)

If the template allows screen-share and the user starts sharing:

  1. The user's screen track flows to the iframe.
  2. The iframe captures frames at intervals and feeds them to the agent's vision input.
  3. The agent sees what the user sees and can reference it directly ("I can see the form on the right — try clicking the dropdown").

Screen-share is the path for walkthroughs (guided installs, software demos, form-fills). The agent narrates and accompanies; the user doesn't have to describe pixels to it.

5. Wrap-up + outputs

The conversation ends in one of two ways:

  • The agent decides it's done. Last phase complete, all critical goals committed; the agent closes with a goodbye.
  • The user (or your code) ends it. Roxels.close() or call.hangUp().

On end:

  1. Status: disconnecting, then ended. Frontend events fire.
  2. Final outputs run. Template-level outputs configured for the end-of-conversation lifecycle.
  3. The summary is generated. An LLM call produces a narrative summary of the conversation.
  4. The complete event fires with { summary, findings, duration_seconds, external_id, auto_close } — both as a frontend event (Events) and via the conversation-completion webhook (if configured).
  5. The iframe cleans up. Microphone released, LiveKit room left, audio stopped.

The conversation now lives in your dashboard with full transcript, all findings, all outputs and their delivery status, and any frames captured during screen-share.

What you can access afterward

For any past conversation:

  • In the dashboard — full transcript, findings, summary, output delivery log, frame thumbnails.
  • Via the REST APIGET /v1/interviews/iv_….
  • Via the MCP package — ask the assistant to retrieve and explain it.

Where the cost goes

Roughly:

  • The largest costs are the LLM calls — the speaker, the extraction layer, the advisors. These compound across turns.
  • Smaller costs are STT, TTS, and the LLM calls in setup (summary generation, extraction).
  • Negligible costs at this scale are the LiveKit transport, the iframe, frame capture.

Templates with many advisors, deep follow-up, and long max-duration cost more per conversation. Templates with focused goals and short conversations cost less.

How "interview" and "session" relate

The terms interview, session, and conversation describe slightly different things in the code, but in practice they're aliases:

  • Conversation is the user-facing term.
  • Interview is the older identifier in some API paths and DB fields (e.g. iv_…, /v1/interviews).
  • Session is the runtime instance — same conversation, but emphasizes the live LiveKit room.

For most purposes, treat them as the same thing. New external code should say "conversation."