Our vision for the future of Tropic: agent observability, expanded integrations, chaos testing, security hardening, and standards interoperability.
Tropic launched with a focused mission: make it safe and simple to run AI agents in production. Isolated VMs, encrypted credentials, Sondera security policies, WhatsApp integration. That foundation is solid and shipping today.
But we're just getting started. This roadmap lays out where we're headed, from security research that doesn't exist anywhere else, to practical integrations that our users are asking for, to experimental concepts that could change how we think about agent reliability.
Everything here is subject to change. Priorities shift as we learn from users and as the ecosystem evolves. Items marked Research are explorations with no committed timeline. Items marked In Progress are actively being built.
Click a bar to jump to its details below.
Our most critical near-term deliverable. We're writing a plain-language whitepaper that explains how Tropic secures OpenClaw agents, without requiring the reader to understand cryptography, networking, or infrastructure.
The whitepaper covers: how agent VMs are isolated, how credentials are encrypted at rest and in transit, what Sondera policies enforce, how WhatsApp message routing works, and what happens when an agent goes rogue. The goal is to give compliance teams, founders, and non-technical stakeholders a clear picture of the security posture without drowning them in jargon.
This document will live on our docs site and be available as a downloadable PDF.
Cedar policy enforcement is live. Tropic now compiles agent guardrails to Cedar (Amazon's open-source policy language) as the underlying policy engine. Cedar policies are formally verifiable, support hierarchical resource models, and can express complex authorization logic that plaintext rules cannot.
On top of Cedar, we ship pre-built policy packs mapped to established security frameworks:
MITRE ATT&CK for Enterprise. Policies that detect and block agent behaviours matching known adversary techniques, including command injection via shell tools, credential access through file reads, lateral movement via network calls, and data exfiltration through outbound HTTP. Each policy references the specific MITRE technique ID (e.g. T1059 Command and Scripting Interpreter).
NIST SP 800-53 Controls. Mappings to relevant NIST controls for access control (AC), audit (AU), system protection (SC), and incident response (IR), giving enterprises a compliance-ready posture out of the box.
OWASP Top 10 for LLM Applications. Policies addressing prompt injection (LLM01), insecure output handling (LLM02), training data poisoning, excessive agency (LLM08), and overreliance (LLM09). We also track the latest OWASP AI Security and Privacy Guide research for additional coverage.
Users can browse, toggle, and customize these policy packs from the Tropic dashboard. No need to write Cedar by hand unless you want to.
OpenClaw and the skills it loads pick up secrets from a .env file on the instance because that is what the runtime expects. That means tokens, API keys, and webhook URLs sit in plain text alongside the agent code, and rotating any one of them is a manual file edit on every VM.
OneCLI Vault is a Tropic-managed secret store. Secrets are written once in the dashboard, stored encrypted, and injected into the OpenClaw process at start time without ever landing on disk. The .env file on the VM becomes a thin pointer to the vault, scrubbed of real values.
Rotation, revocation, and per-skill scoping become dashboard operations, with every read recorded in the audit log. Same primitive will back skill-level secrets in the marketplace so paid skills can ship with their own keys without leaking them to the agent process.
A fundamental problem with LLM-based security: are your controls actually deterministic? If you tell an agent "never execute rm -rf /", does it obey 100% of the time? 99.9%? 98%? The answer depends on input length, model choice, prompt position, and a dozen other variables.
We're building a testing framework that systematically evaluates security control adherence across:
Input length variation. Do long prompts cause the agent to "forget" its security rules? We test with inputs from 100 tokens to 128k tokens, measuring policy compliance at each threshold.
Multi-model evaluation. The same security policy may hold perfectly on Claude Sonnet but fail on Haiku. We evaluate across every supported model to surface model-specific failure modes.
Adversarial prompting. Standard jailbreak taxonomies (role-playing, encoding tricks, multi-turn escalation) applied specifically to Tropic's security policies, not generic safety benchmarks.
The output is a determinism score per policy per model: a number that tells you "this security rule holds 99.7% of the time on Sonnet 4.6 with inputs under 32k tokens." This gives operators real data to make deployment decisions, rather than trusting vibes.
Every Tropic VM now ships with a telemetry plugin that captures agent activity in real time. The plugin hooks into OpenClaw's event system and records tool calls, message content (with preview snippets), token usage, response latency, and source channel (web, WhatsApp, API).
The Tropic dashboard includes a dedicated Logs page with filterable telemetry events per instance. You can filter by source channel, search by content, and see conversation flow with request/response previews. Events are stored in Supabase with indexed fields for fast querying.
OpenClaw's built-in Control UI is also exposed through our proxy layer, giving you a per-instance real-time view of active sessions, tool calls, and message history without any additional setup.
Next up: cross-instance aggregation, anomaly detection, and hallucination measurement research.
Audit logging is live. Every Tropic VM now runs Fluentd alongside auditd, capturing two log streams: OpenClaw gateway activity (tool calls, messages, sessions) and system-level audit events (process execution, network connections, file access, permission changes).
Logs are compressed and shipped to S3 with object lock (compliance mode, 90-day default retention). An index sync service registers each chunk in Supabase with metadata (source, time range, entry count, size), enabling fast filtered queries without scanning S3 directly.
The dashboard includes an Audit Logs section per instance with source filtering, date range selection, expandable log entries, and live tailing via Server-Sent Events. Live tail streams directly from Fluentd on the VM through our nginx proxy layer, with auto-scroll, pause/resume, and connection status indicators.
Retention is configurable per account: 90 days for SME tier, up to 7 years for Enterprise. Audit logs are gated to SME and Enterprise tiers; live tailing is available to all tiers.
Credit monitoring is live. The credit deduction cron tracks per-instance compute costs and per-skill hourly costs from the marketplace. VMs are automatically stopped when credits hit zero, with proper state management.
The Metrics page in the dashboard now shows unified API and compute spend with breakdowns by model, time series, and per-session attribution. The benchmark system reports token usage and cost per scenario, giving visibility into per-conversation cost across different models and workloads.
Further refinements such as spend threshold alerts and exportable usage reports for accounting will continue to land incrementally.
OpenClaw version management is operational. Users can select their preferred OpenClaw version from a dropdown when provisioning a new VM, with versions automatically discovered from the npm registry. The Packer pipeline builds AMIs with the selected version pre-installed, configured, and hardened with SecureClaw.
AMI builds include compatibility checks: each new OpenClaw release is reviewed for breaking changes (e.g. the plugins.allow requirement in 2026.3.12), and provisioning code is updated before AMIs are rebuilt. The build pipeline handles needrestart suppression, Fluentd/auditd installation, and security baseline hardening automatically.
A benchmark system provides regression detection: 10 scenarios across 3 iterations per run, with automated scoring and confidence ratings. This catches performance regressions and tool-call failures before new versions reach production.
Still in flight: a fuller automated integration test suite. We are extending coverage so every release runs end-to-end checks of agent deployment, credential sync, WhatsApp pairing, scheduling, and security policy enforcement before promotion. We need significantly more automated tests before calling this complete.
Tropic VMs run a pinned OpenClaw version chosen at provisioning time. Moving to a newer release today is a manual reprovision or a re-run of setup against the live VM. Most users never upgrade, which means they miss security fixes and new runtime features.
Auto Upgrade promotes a newly published OpenClaw build to existing instances on a rolling schedule, gated by the benchmark regression suite and opt-out controls per workspace. The rollout pauses automatically if benchmark scores drop or audit-log error rates spike.
Users keep full control: pin to a specific version, opt into nightly, or stay on the latest stable. Skill compatibility is verified against the new build before it lands on a customer VM.
Today, deploying an agent on Tropic requires you to write your own agent configuration: policy, tools, system prompt. That's powerful for advanced users but a barrier for everyone else.
We're building opinionated, ready-to-deploy agent templates for the most common use cases:
Marketing Agents. Content generation, social media scheduling, SEO analysis, competitor monitoring. Pre-configured with web browsing tools, content-safe policies, and brand-voice system prompts.
Research Agents. Literature review, data collection, report synthesis. Equipped with web search, PDF reading, structured output tools, and citation-tracking policies that prevent hallucinated references.
Coding Agents. Code review, bug triage, documentation generation, test writing. Pre-loaded with code execution tools, git integration, and security policies that prevent agents from pushing to production branches or accessing secrets.
Each template comes with a tuned Sondera security policy, recommended tools, and a getting-started guide. Deploy in one click, customize from there.
NanoClaw is a lightweight OpenClaw variant designed for resource-constrained environments and ultra-fast cold starts. It strips the heavier orchestration layers down to the essentials needed to run a guarded agent loop with policy enforcement and channel pairing.
Tropic's NanoClaw support is in beta. Provisioning, policy, telemetry, and channel pairing all work, but we are still hardening it across the supported model providers and tuning resource defaults. Expect bugs at the edges; we want feedback from early users while we iterate.
OpenClaw's native CRON is unsafe: it can schedule runs of any executable on the host, which is incompatible with Tropic's policy model. We are replacing it with a calendar-driven Tropic Scheduling layer.
Users create scheduled events in a calendar UI. Each event targets a specific agent and channel, with the trigger payload and policy enforced through the same Sondera/Cedar guardrails as live messages. Past runs surface in a popover with the agent's reply text and outcome.
The scheduling layer is in active production testing. We are validating delivery semantics, idempotency on retries, and the edge cases around timezone handling and DST transitions before promoting it out of beta.
OpenClaw's built-in web search tool is great for read-only research but cannot handle authenticated browsing flows. Many real workflows require an agent to log in to a SaaS app, complete a multi-step form, or make a payment, none of which are safe to run through a raw browser controlled by the model.
Browserless Proxy is Tropic's replacement: a hardened, policy-aware browser sandbox that the agent drives through a constrained interface. The proxy mediates every action through Cedar policies, so behaviours like "log in to Gmail" or "complete checkout on Stripe" can be allowed, require-confirmed, or denied at the proxy layer instead of trusting the model.
The goal: full web interactivity (logins, multi-step forms, payments) without giving the agent unsupervised browser access. Planned for Q2 2026.
Hermes is an emerging open agent runtime focused on lightweight inter-agent messaging and delegation. We plan to add first-class Hermes support alongside OpenClaw and NanoClaw so Tropic can host whichever runtime best fits the workload.
Same Tropic primitives apply: isolated VMs, encrypted credentials, Sondera/Cedar policies, channel pairing, telemetry, and audit logging, regardless of the underlying runtime. Planned for Q3 2026.
Today every skill in the Tropic marketplace is free. That keeps adoption frictionless but it leaves authors on the hook to maintain their work for nothing, and it caps the depth of what gets shipped.
Marketable Skills turns the marketplace into a two-sided economy. Authors set a per-install monthly price, Tropic handles Stripe billing and revenue share, and end users see paid skills alongside free ones with transparent pricing and ratings. Refund window, install caps, and revocation are all first-class.
Security validation on every paid upload (static analysis, manifest review, signed builds) so a paid listing carries higher provenance guarantees than a casual upload. Authors can also offer a free tier and gate advanced features behind paid plans.
Today, getting two Tropic agents to coordinate means piggybacking on a shared Telegram group or a Slack channel. The messages get squeezed into a human-readable format, structure is lost, and the agents end up parsing each other's natural-language replies.
Multi-Agent Communication is a native, typed message bus between agents in the same Tropic workspace. Agents publish and subscribe to topics, hand off tasks with structured payloads, and share memory at the workspace level so a long-running job can move between specialists without a context reset.
Cedar policies apply at the bus boundary: an agent's permission to talk to another agent (or read another agent's output) is governed by the same ALLOW / REQUIRE CONFIRM / DENY model as external tool calls. No more piggybacking on Telegram to glue agents together.
Inspired by Netflix's Chaos Monkey, ChaosLobster is our concept for systematically stress-testing AI agents before they hit production. The premise: if your agent can't handle degraded conditions gracefully, you'll find out in testing, not from an angry customer.
ChaosLobster injects controlled failures across multiple dimensions:
OS-level chaos. Disk pressure, memory limits, CPU throttling, file permission changes. Does your agent handle "disk full" errors, or does it silently corrupt data?
Network chaos. Latency injection, packet loss, DNS failures, connection resets. How does your agent behave when an API call takes 30 seconds instead of 300ms?
Infrastructure chaos. Instance reboots, credential rotation mid-session, service discovery failures. Can your agent recover from a restart, or does it lose all context?
API chaos. Rate limiting, 500 errors, malformed responses, schema changes. Does your agent retry intelligently, or does it spiral into an error loop?
Database chaos. Connection pool exhaustion, slow queries, read replica lag, constraint violations. How does your agent handle a database that's technically up but practically unusable?
Each chaos scenario produces a resilience score and a detailed failure report. The vision: run ChaosLobster against your agent, get a production-readiness certificate, and know exactly where your weak points are.
The current way to launch an agent on Tropic is the traditional platform flow: pick a template, configure policy, attach skills, hook up a channel, deploy. That works for Enterprise users with operations teams, but it is too much friction for everyone else.
Chat-based spawning flips it around. You type your objective in plain language ("monitor my email for invoices and forward anything over $500 to my accountant on WhatsApp"), and Tropic picks the right agents, skills, and channel bindings, then asks one or two clarifying questions before it goes live.
The platform UI stays available for power users who want fine-grained control. The chat surface is the default path for anyone who just wants the outcome.
Today, attaching skills or moving agents between Claw instances is a navigation exercise: dashboards, drawers, dropdowns. It works, but it requires familiarity with the platform layout, and that's a barrier we want to remove.
We are designing a drag-and-drop canvas where agents and skills are visual tiles you can drop onto a Claw instance to attach them, drag between instances to move them, and rearrange to express composition. The result: the entire mental model of "what is this agent made of, and where does it live" becomes visible without training.
The traditional platform UI continues to ship for Enterprise users who prefer it. Drag-and-drop is the new default for individual builders and small teams.
The "pause and ask a human" step in an agent workflow is currently hacked together: agents DM a Slack channel, watch for a thumbs-up emoji, then resume. It works but the approval state lives in Slack, not in Tropic, and there is no audit trail tying the decision back to the run.
Native Approvals brings the human-in-the-loop step into Tropic itself. Agents call an approval primitive with structured context (what they want to do, why, and what happens on approve/deny). The request shows up in the Tropic dashboard, in mobile push, and optionally mirrored to the user's preferred channel. The approver hits approve or deny with an optional comment, the audit log records the full decision, and the agent resumes with the result.
Same primitive will back the REQUIRE CONFIRM tier of Sondera/Cedar policies, so policy-driven confirmations and skill-driven confirmations share one interface and one audit trail.
WhatsApp was our first messaging channel because it's where most of our Singapore-based users already are. But agents need to meet users where they work.
Slack OAuth integration. Connect your Tropic agent to a Slack workspace. The agent appears as a bot user, responds in channels or DMs, and respects Slack's threading model. Full OAuth flow so workspace admins can approve the integration.
Telegram bot support. Register your agent as a Telegram bot. Supports both private chats and group mentions, with Telegram's native inline keyboard for confirmations.
Discord bot support. Deploy your agent to a Discord server. Channel-scoped permissions, slash command support, and thread-based conversations.
All channels will go through Tropic's security layer. Sondera policies apply regardless of whether the message came from WhatsApp, Slack, or Discord.
Multi-provider model support is live. Tropic supports every model that OpenClaw supports, including Claude, OpenAI, OpenRouter, Gemini, xAI, Groq, Mistral, Together, Moonshot, and Z.ai. Bring your own provider key in Settings, or connect ChatGPT via OAuth without copying an API key.
Model selection is a per-instance setting. You can run one agent on Claude Opus for complex reasoning and another on GPT-5.4 for high-volume, low-cost tasks. The launch modal only surfaces models for providers you have keys configured for.
Tested in production: OpenAI API, OpenAI OAuth (ChatGPT Plus/Pro), Anthropic API key, and OpenRouter API key. Other providers are wired up through OpenClaw but have not been individually verified by us yet. If you hit issues with a specific model or provider, please raise them at michael@tropic.bot and we will investigate.
Local machine support is live. You can now connect your own hardware to the Tropic control plane through SSM hybrid activation. The dashboard provides an interactive setup wizard: click "Add Local Machine", get a setup token, run a single command, and your machine appears in the dashboard within seconds.
Local instances support the full feature set: agent deployment, skill installation, credential sync, settings management, and security policy enforcement. The same Sondera and SecureClaw protections that run on cloud VMs apply to local machines.
Connectivity is resilient. The SSM agent handles reconnection automatically when your machine goes offline and comes back. Status indicators in the dashboard show real-time online/offline state.
The agent infrastructure landscape is evolving fast. Multiple standards and frameworks are emerging for agent observability, communication, and orchestration. We're actively researching how these fit into Tropic's architecture.
OpenTelemetry (OTEL) for Agents. OTEL is the de facto standard for distributed tracing and metrics in microservices. Agent workloads have similar observability needs: trace a request from user message through tool calls through LLM inference and back. We're evaluating OTEL's semantic conventions for LLM/GenAI workloads and whether to adopt them as Tropic's native telemetry format.
Agent-to-Agent Protocol (A2A). Google's A2A protocol defines how agents discover, communicate, and delegate to each other. As multi-agent workflows become more common, having a standard wire protocol matters. We're evaluating A2A for inter-agent coordination on the Tropic platform. Can a research agent on one VM delegate a coding task to a coding agent on another?
AWS Strands and AgentCore. Amazon's Strands SDK provides agent building primitives (tool use, memory, planning), while AgentCore is their managed runtime for deploying agents at scale. Given that Tropic already runs on AWS infrastructure, there's a natural integration path. We're evaluating whether AgentCore's runtime model (identity, observability, memory management) can complement or replace parts of our current VM-based isolation.
Pi Framework comparison. Pi takes a different approach: a full-stack agent framework with built-in tool management, conversation memory, and deployment primitives. We're comparing Pi's opinionated approach against the composable OTEL + A2A + Strands stack to determine which philosophy better serves Tropic's security-first architecture.
No decisions yet. This is active research. Our guiding principle: adopt standards that make agents more observable and interoperable, but never at the cost of security isolation. If a standard requires relaxing our sandbox boundaries, we won't adopt it.
We intentionally exclude things we're not confident about. If it's not listed here, it doesn't mean we won't build it. It just means we haven't committed to it yet. We'd rather under-promise and over-deliver.
Have something you want to see? We're a small team and we build based on what our users need. Reach out at michael@tropic.bot or find us on the channels above.
← Back to Blog