Back to Dev Log
|Augmi Team

Docker rollout audit + managed LLM billing hardening kickoff

Audited Docker/OpenClaw rollout blockers and started hardening managed LLM credit billing with a full flow audit plus atomic/idempotent DB migration.

Dev Log - March 2, 2026

Docker image state and expected package version

  • Confirmed infrastructure/docker/Dockerfile is pinned to openclaw@2026.2.26 (RUN npm install -g openclaw@2026.2.26), so the intended container baseline is already set in source.
  • Confirmed recent infra commits touching Docker startup/runtime behavior were already present before this run:
    • 1f5f1c82 (infrastructure/docker/Dockerfile, infrastructure/docker/start.sh)
    • 2173e061 (infrastructure/docker/start.sh)
  • Verified this incident is not caused by a missing version pin in the Dockerfile.

Build and deploy attempts executed

  • Ran ./infrastructure/fly/build-and-deploy.sh:
    • OpenClaw config contract tests passed.
    • Local Docker build completed.
    • Registry push repeatedly stalled on a single layer retry loop, preventing a clean rollout completion.
  • Ran flyctl deploy --remote-only (from Docker context):
    • Remote build progressed and compiled openclaw@2026.2.26.
    • Export/publish failed with unauthenticated: Invalid token from Fly/Depot path.
  • Attempted explicit machine updates to registry.fly.io/hexly-sandboxes:latest; update calls succeeded, but latest still resolved to an older runtime for at least one target machine.

Machine-level rollout results (requested IDs)

  • 890490a6d27338:
    • Verified openclaw --version = 2026.2.26 (updated and healthy).
  • 78460ddb445428:
    • Repeated install attempts were unstable due to interrupted/stalled global npm installation behavior.
    • After restarts and rechecks, final verified openclaw --version remained 2026.2.15.
    • Observed partial package states during retries (missing openclaw.mjs during interrupted install windows), which explains transient “cannot run commands” behavior.

Operational diagnosis recorded

  • Root blocker is release-path reliability, not source config:
    • Local push path: layer retry loops while publishing to Fly registry.
    • Remote path: Depot/Fly auth token issue during export.
  • Because new image publication is unreliable, machine updates from :latest can still pull an older working image, producing mixed OpenClaw versions across instances.
  • Tool availability symptoms (“cannot fetch websites”, “cannot run commands”, env usage inconsistency) can appear when machines run mixed runtime states or partially-installed CLI binaries.

Managed LLM billing audit and hardening kickoff

  • Performed an end-to-end audit of managed LLM call flow and credit deduction timing across:
    • app/api/llm/v1/chat/completions/route.ts
    • lib/flyio-machine-manager.ts
    • infrastructure/docker/start.sh
    • supabase/migrations/*managed_billing*, *credit_system_overhaul*, *zero_overdraft*
  • Documented sequence diagrams and correctness findings in docs/diagrams/llm-credit-billing-audit-2026-02-25.md.
  • Confirmed current behavior:
    • Cost is measured from provider usage after response (or stream completion).
    • Deduction happens post-response via proxy route.
    • Current DB policy is zero-overdraft (deductions that would go negative are rejected).
  • Identified gaps for managed proxy billing:
    • Missing idempotency constraint for source IN ('proxy','proxy_stream').
    • Non-atomic usage insertion + deduction path can leave ledger/balance mismatch on failure.
  • Implemented DB hardening migration supabase/migrations/20260302_proxy_billing_atomic_idempotency.sql:
    • Deduplicates historical proxy rows by (request_id, source).
    • Adds partial unique index for proxy idempotency.
    • Adds record_proxy_usage_and_deduct_credits(...) RPC to atomically deduct balance and insert usage row in one transaction.
    • Handles unique-violation races as idempotent duplicates.

Decision Points

  • Temporary BYOK-only UX policy: hide managed LLM controls in the Settings UI while backend hardening is in progress.
  • Implemented as a UI gate in app/settings/page.tsx (MANAGED_LLM_UI_ENABLED = false) so the code path remains in repo but is not accessible in Settings.
  • Scope intentionally limited to UI visibility (no backend deletion) to allow controlled re-enable after billing/idempotency verification is complete.

Stats

  • 2 commits in infra Docker/Fly scope since 2026-03-01 (pre-existing in repo).
  • ~101 insertions / ~16 deletions across 3 infra files from those commits.
  • Key areas: Docker/OpenClaw pin verification, Fly build/deploy reliability, per-machine version validation.