Our vision for the future of Tropic: agent observability, expanded integrations, chaos testing, security hardening, and standards interoperability.
Tropic launched with a focused mission: make it safe and simple to run AI agents in production. Isolated VMs, encrypted credentials, Sondera security policies, WhatsApp integration. That foundation is solid and shipping today.
But we're just getting started. This roadmap lays out where we're headed, from security research that doesn't exist anywhere else, to practical integrations that our users are asking for, to experimental concepts that could change how we think about agent reliability.
Everything here is subject to change. Priorities shift as we learn from users and as the ecosystem evolves. Items marked Research are explorations with no committed timeline. Items marked In Progress are actively being built.
Our most critical near-term deliverable. We're writing a plain-language whitepaper that explains how Tropic secures OpenClaw agents, without requiring the reader to understand cryptography, networking, or infrastructure.
The whitepaper covers: how agent VMs are isolated, how credentials are encrypted at rest and in transit, what Sondera policies enforce, how WhatsApp message routing works, and what happens when an agent goes rogue. The goal is to give compliance teams, founders, and non-technical stakeholders a clear picture of the security posture without drowning them in jargon.
This document will live on our docs site and be available as a downloadable PDF.
Today, Tropic enforces agent behaviour through Sondera security policies: plaintext ALLOW / REQUIRE CONFIRM / DENY rules injected into the agent's system prompt. This works, but it's not auditable, composable, or machine-readable.
We're moving to Cedar (Amazon's open-source policy language) as the underlying policy engine. Cedar policies are formally verifiable, support hierarchical resource models, and can express complex authorization logic that plaintext rules cannot.
On top of Cedar, we're building pre-built policy packs mapped to established security frameworks:
MITRE ATT&CK for Enterprise (Top 10): Policies that detect and block agent behaviours matching known adversary techniques, including command injection via shell tools, credential access through file reads, lateral movement via network calls, data exfiltration through outbound HTTP, and more. Each policy will reference the specific MITRE technique ID (e.g. T1059 Command and Scripting Interpreter).
NIST SP 800-53 Controls: Mappings to relevant NIST controls for access control (AC), audit (AU), system protection (SC), and incident response (IR). This gives enterprises a compliance-ready posture out of the box.
OWASP Top 10 for LLM Applications: Policies addressing prompt injection (LLM01), insecure output handling (LLM02), training data poisoning, excessive agency (LLM08), and overreliance (LLM09). We're also tracking the latest research from OWASP's AI Security and Privacy Guide for additional coverage.
Users will be able to browse, toggle, and customize these policy packs from the Tropic dashboard. No need to write Cedar by hand unless you want to.
A fundamental problem with LLM-based security: are your controls actually deterministic? If you tell an agent "never execute rm -rf /", does it obey 100% of the time? 99.9%? 98%? The answer depends on input length, model choice, prompt position, and a dozen other variables.
We're building a testing framework that systematically evaluates security control adherence across:
Input length variation. Do long prompts cause the agent to "forget" its security rules? We test with inputs from 100 tokens to 128k tokens, measuring policy compliance at each threshold.
Multi-model evaluation. The same security policy may hold perfectly on Claude Sonnet but fail on Haiku. We evaluate across every supported model to surface model-specific failure modes.
Adversarial prompting. Standard jailbreak taxonomies (role-playing, encoding tricks, multi-turn escalation) applied specifically to Tropic's security policies, not generic safety benchmarks.
The output is a determinism score per policy per model: a number that tells you "this security rule holds 99.7% of the time on Sonnet 4.6 with inputs under 32k tokens." This gives operators real data to make deployment decisions, rather than trusting vibes.
Every Tropic VM now ships with a telemetry plugin that captures agent activity in real time. The plugin hooks into OpenClaw's event system and records tool calls, message content (with preview snippets), token usage, response latency, and source channel (web, WhatsApp, API).
The Tropic dashboard includes a dedicated Logs page with filterable telemetry events per instance. You can filter by source channel, search by content, and see conversation flow with request/response previews. Events are stored in Supabase with indexed fields for fast querying.
OpenClaw's built-in Control UI is also exposed through our proxy layer, giving you a per-instance real-time view of active sessions, tool calls, and message history without any additional setup.
Next up: cross-instance aggregation, anomaly detection, and hallucination measurement research.
Audit logging is live. Every Tropic VM now runs Fluentd alongside auditd, capturing two log streams: OpenClaw gateway activity (tool calls, messages, sessions) and system-level audit events (process execution, network connections, file access, permission changes).
Logs are compressed and shipped to S3 with object lock (compliance mode, 90-day default retention). An index sync service registers each chunk in Supabase with metadata (source, time range, entry count, size), enabling fast filtered queries without scanning S3 directly.
The dashboard includes an Audit Logs section per instance with source filtering, date range selection, expandable log entries, and live tailing via Server-Sent Events. Live tail streams directly from Fluentd on the VM through our nginx proxy layer, with auto-scroll, pause/resume, and connection status indicators.
Retention is configurable per account: 90 days for SME tier, up to 7 years for Enterprise. Audit logs are gated to SME and Enterprise tiers; live tailing is available to all tiers.
Credit monitoring has improved significantly. The credit deduction cron now tracks per-instance costs with model-aware pricing (Sonnet 50 cr/hr, Opus 100 cr/hr, Haiku 20 cr/hr) and per-skill hourly costs from the marketplace. VMs are automatically stopped when credits hit zero, with proper state management.
The benchmark system now reports token usage and cost per scenario, giving visibility into per-conversation cost attribution across different models and workloads.
Still planned: hourly/daily/weekly usage charts, spend threshold alerts, exportable usage reports for accounting, and a dedicated billing dashboard.
OpenClaw version management is now fully operational. Users can select their preferred OpenClaw version from a dropdown when provisioning a new VM, with versions automatically discovered from the npm registry. The Packer pipeline builds AMIs with the selected version pre-installed, configured, and hardened with SecureClaw.
AMI builds include automated compatibility checks: each new OpenClaw release is reviewed for breaking changes (e.g. the plugins.allow requirement in 2026.3.12), and provisioning code is updated before AMIs are rebuilt. The build pipeline handles needrestart suppression, Fluentd/auditd installation, and security baseline hardening automatically.
A benchmark system provides regression detection: 10 scenarios across 3 iterations per run, with automated scoring and confidence ratings. This catches performance regressions and tool-call failures before new versions reach production.
Still planned: fully automated integration test suite that spins up a fresh instance and validates end-to-end flows (agent deployment, credential sync, WhatsApp pairing, security policy enforcement).
Today, deploying an agent on Tropic requires you to write your own agent configuration: policy, tools, system prompt. That's powerful for advanced users but a barrier for everyone else.
We're building opinionated, ready-to-deploy agent templates for the most common use cases:
Marketing Agents. Content generation, social media scheduling, SEO analysis, competitor monitoring. Pre-configured with web browsing tools, content-safe policies, and brand-voice system prompts.
Research Agents. Literature review, data collection, report synthesis. Equipped with web search, PDF reading, structured output tools, and citation-tracking policies that prevent hallucinated references.
Coding Agents. Code review, bug triage, documentation generation, test writing. Pre-loaded with code execution tools, git integration, and security policies that prevent agents from pushing to production branches or accessing secrets.
Each template comes with a tuned Sondera security policy, recommended tools, and a getting-started guide. Deploy in one click, customize from there.
Inspired by Netflix's Chaos Monkey, ChaosLobster is our concept for systematically stress-testing AI agents before they hit production. The premise: if your agent can't handle degraded conditions gracefully, you'll find out in testing, not from an angry customer.
ChaosLobster injects controlled failures across multiple dimensions:
OS-level chaos. Disk pressure, memory limits, CPU throttling, file permission changes. Does your agent handle "disk full" errors, or does it silently corrupt data?
Network chaos. Latency injection, packet loss, DNS failures, connection resets. How does your agent behave when an API call takes 30 seconds instead of 300ms?
Infrastructure chaos. Instance reboots, credential rotation mid-session, service discovery failures. Can your agent recover from a restart, or does it lose all context?
API chaos. Rate limiting, 500 errors, malformed responses, schema changes. Does your agent retry intelligently, or does it spiral into an error loop?
Database chaos. Connection pool exhaustion, slow queries, read replica lag, constraint violations. How does your agent handle a database that's technically up but practically unusable?
Each chaos scenario produces a resilience score and a detailed failure report. The vision: run ChaosLobster against your agent, get a production-readiness certificate, and know exactly where your weak points are.
WhatsApp was our first messaging channel because it's where most of our Singapore-based users already are. But agents need to meet users where they work.
Slack OAuth integration. Connect your Tropic agent to a Slack workspace. The agent appears as a bot user, responds in channels or DMs, and respects Slack's threading model. Full OAuth flow so workspace admins can approve the integration.
Telegram bot support. Register your agent as a Telegram bot. Supports both private chats and group mentions, with Telegram's native inline keyboard for confirmations.
Discord bot support. Deploy your agent to a Discord server. Channel-scoped permissions, slash command support, and thread-based conversations.
All channels will go through Tropic's security layer. Sondera policies apply regardless of whether the message came from WhatsApp, Slack, or Discord.
Tropic currently supports Claude models (Sonnet, Opus, Haiku) as the underlying LLM. We're expanding model support to give users more choice:
ChatGPT (OpenAI). Use GPT-4o, GPT-4.1, or o3 as your agent's brain. Bring your own OpenAI API key, or use Tropic's managed key with per-model credit pricing.
OpenRouter. Access 100+ models through a single integration. OpenRouter handles routing, fallback, and rate limiting across providers. Ideal for users who want to experiment with different models or need access to open-source options like Llama, Mistral, or Gemma.
Model selection will remain a per-instance setting. You can run one agent on Claude Opus for complex reasoning and another on GPT-4o-mini for high-volume, low-cost tasks.
Local machine support is live. You can now connect your own hardware to the Tropic control plane through SSM hybrid activation. The dashboard provides an interactive setup wizard: click "Add Local Machine", get a setup token, run a single command, and your machine appears in the dashboard within seconds.
Local instances support the full feature set: agent deployment, skill installation, credential sync, settings management, and security policy enforcement. The same Sondera and SecureClaw protections that run on cloud VMs apply to local machines.
Connectivity is resilient. The SSM agent handles reconnection automatically when your machine goes offline and comes back. Status indicators in the dashboard show real-time online/offline state.
The agent infrastructure landscape is evolving fast. Multiple standards and frameworks are emerging for agent observability, communication, and orchestration. We're actively researching how these fit into Tropic's architecture.
OpenTelemetry (OTEL) for Agents. OTEL is the de facto standard for distributed tracing and metrics in microservices. Agent workloads have similar observability needs: trace a request from user message through tool calls through LLM inference and back. We're evaluating OTEL's semantic conventions for LLM/GenAI workloads and whether to adopt them as Tropic's native telemetry format.
Agent-to-Agent Protocol (A2A). Google's A2A protocol defines how agents discover, communicate, and delegate to each other. As multi-agent workflows become more common, having a standard wire protocol matters. We're evaluating A2A for inter-agent coordination on the Tropic platform. Can a research agent on one VM delegate a coding task to a coding agent on another?
AWS Strands and AgentCore. Amazon's Strands SDK provides agent building primitives (tool use, memory, planning), while AgentCore is their managed runtime for deploying agents at scale. Given that Tropic already runs on AWS infrastructure, there's a natural integration path. We're evaluating whether AgentCore's runtime model (identity, observability, memory management) can complement or replace parts of our current VM-based isolation.
Pi Framework comparison. Pi takes a different approach: a full-stack agent framework with built-in tool management, conversation memory, and deployment primitives. We're comparing Pi's opinionated approach against the composable OTEL + A2A + Strands stack to determine which philosophy better serves Tropic's security-first architecture.
No decisions yet. This is active research. Our guiding principle: adopt standards that make agents more observable and interoperable, but never at the cost of security isolation. If a standard requires relaxing our sandbox boundaries, we won't adopt it.
We intentionally exclude things we're not confident about. If it's not listed here, it doesn't mean we won't build it. It just means we haven't committed to it yet. We'd rather under-promise and over-deliver.
Have something you want to see? We're a small team and we build based on what our users need. Reach out at michael@tropic.bot or find us on the channels above.
← Back to Blog