Architecture Planning 2024

Tasmanian Cloud Architecture Planning

This document outlines the current and planned architecture for Tasmanian Cloud infrastructure, reflecting recent decisions and technologies.

Overview

Tasmanian Cloud is evolving from a simple Proxmox-based VPS provider to a comprehensive multi-tenant cloud platform with Kubernetes, multi-cloud support, and modern developer tooling.

Current Architecture (As-Is)

Infrastructure Layer

  • Proxmox VE 9.1.4 cluster (twnhost1-4)
  • LXC containers for system services
  • KVM VMs for customer workloads
  • Local-lvm storage per node
  • UniFi networking with multiple VLANs (3, 60-64, 545, 645)
  • NetBird VPN for secure access

Network Configuration

HostNetworkPurpose
twnhost125GbE Mellanox CX5High-performance workloads
twnhost22.5GbEGeneral workloads
twnhost310GbE SFP+ + 10GbE RJ45Storage + Network services
twnhost42.5GbEGeneral workloads

Current Services

  • GlobalSO (Wazuh, Pangolin, NetBird proxy)
  • Paymenter (billing)
  • Various monitoring tools

Planned Architecture (To-Be)

Platform Vision: Unified Management Experience

Currently, VPS (KVM/LXC), Kubernetes, and Templates are managed through separate interfaces. We are actively developing a unified platform that brings these together:

FeatureCurrent StatePlannedTimeline
KVM/LXC ManagementO2S PortalUnified API + CLIQ2 2024
KubernetesSeparate provisioningUnified control planeQ2 2024
Docker PaaSNot availableCoolify integrationQ3 2024
Template SystemBasicDocker Compose + AI assistantQ3 2024
Unified APISeparate endpointsSingle REST/GraphQL APIQ2 2024

Customer Impact: You'll be able to manage VMs, Kubernetes clusters, and containerized applications from a single interface with consistent authentication, billing, and networking.


1. Proxmox Infrastructure Updates

LXC for System Services

Continue using LXC for:

  • VPN gateways (NetBird/Headscale)
  • Monitoring stack (Prometheus, Grafana)
  • Databases (PostgreSQL primary)
  • Reverse proxies (Pangolin → future Traefik)
  • Management tools (Salt master, Temporal)

KVM for Customer Workloads

Standard VM sizes:

SizevCPURAMDiskUse Case
Small12GB20GBDevelopment
Medium24GB40GBProduction apps
Large48GB80GBDatabases
XLarge816GB160GBHigh performance

2. Talos Linux for Kubernetes

Replace Ubuntu with Talos Linux for all Kubernetes nodes:

Why Talos:

  • Immutable OS (read-only root filesystem)
  • API-driven management (no SSH)
  • Automatic updates with rollback
  • Minimal attack surface (~80MB)

Implementation:

Stage 1: Proxmox (OpenTofu)
├── 3x Control Plane VMs (Talos)
├── 3x Worker VMs (Talos)
└── Cilium CNI (L2 LB, Hubble)

Stage 2: Kubernetes
├── Cilium installation
├── Gateway API
└── Storage classes

Stage 3: ArgoCD
├── GitOps deployment
└── App of apps pattern

Reference: proxmox-talos-opentofu

3. vCluster for Multi-Tenancy

Option 1: vCluster per Tenant

Tenant A → vcluster-tenant-a → Full cluster access
Tenant B → vcluster-tenant-b → Full cluster access

Option 2: Namespaces for Simple Containers

Talos Cluster
├── namespace: tenant-a (NetworkPolicy isolation)
├── namespace: tenant-b (NetworkPolicy isolation)
└── Coolify for container deployments

Decision:

  • vCluster for teams needing full K8s control
  • Namespaces + Coolify for simple container hosting

4. Crossplane for Infrastructure as Code

Provider Stack:

  • provider-proxmox-bpg - Proxmox VM/LXC management
  • provider-kubernetes - In-cluster resources
  • provider-helm - Application deployment
  • provider-aws/gcp - Multi-cloud failover (future)

GitOps Integration:

Git Repo → Flux → Crossplane → Proxmox API

Customer-Facing API:

apiVersion: tascloud.io/v1alpha1
kind: VirtualMachine
spec:
  size: medium
  image: ubuntu-22.04
  region: sydney

5. Storage Architecture

Phase 1: Ceph Migration (Current)

Problem: VMs stuck on local-lvm, slow migrations

Solution:

Ceph Cluster (3+ nodes)
├── vm-fast pool (NVMe SSD)
│   └── VM root disks
├── vm-data pool (SATA SSD)
│   └── VM data volumes
└── backups pool (compressed)
    └── Automated backups

Migration Strategy:

  • Side-by-side deployment (keep existing network)
  • Add vmbr50 (25GbE) for Ceph only
  • Gradual VM migration with zero downtime

Phase 2: Tiered Storage

TierMediaUse Case
HotNVMeVM OS, active databases
WarmSATA SSDLogs, backups
ColdCloudflare R2Archives, compliance

6. Networking Updates

High-Speed Storage Network

vmbr50 (25GbE)
├── twnhost1 ↔ twnhost3
└── Ceph replication traffic only

SDN Zones per Tenant

Zone: tenant-acme (VXLAN 10000)
├── VNet: vms (10.100.1.0/24)
├── VNet: k8s (10.100.2.0/24)
└── VNet: services (10.100.3.0/24)

VPN Integration

  • Tailscale - Primary VPN mesh
  • NetBird - Alternative/backup
  • Headscale - Self-hosted coordination

7. Application Platform

Coolify for Container Hosting

Best for: Simple container deployments

Features:

  • Git-based deployment
  • Automatic SSL
  • One-click databases
  • Preview environments

Integration:

  • O2S frontend
  • Logto authentication
  • Paymenter billing

Custom Template System

Features:

  • 50+ pre-built templates
  • Docker Compose converter
  • AI assistant for deployment
  • Visual composer (drag-and-drop)

8. Developer Portal (Backstage)

Purpose: Internal development, not customer-facing deployment

Integrations:

  • Service catalog (GitLab, Kubernetes)
  • Proxmox management (custom plugin)
  • Scalar API documentation
  • Grafana dashboards (embedded)
  • GitOps cluster view (Flux)

NOT for:

  • Customer VM provisioning (use O2S)
  • Production deployments (use Temporal)

9. Management API (Rust)

Core Services:

tascloud-platform/
├── crates/
│   ├── tascloud-core      # Domain models
│   ├── tascloud-api       # Axum REST API
│   ├── tascloud-cli       # Binary Lane-style CLI
│   ├── tascloud-worker    # Temporal workflows
│   └── tascloud-temporal  # Orchestration

Technology Stack:

  • Language: Rust
  • Web Framework: Axum
  • Database: PostgreSQL + Valkey (cache)
  • Object Storage: RustFS (S3-compatible)
  • Backups: Cloudflare R2
  • Workflows: Temporal
  • Search: Meilisearch

10. Central SDK

Architecture:

Rust Core (tascloud-sdk)
    ↓ FFI
├── Python bindings
├── Go bindings
├── Node.js bindings
└── Java bindings

Composite Design:

  • Binary Lane - VPS management style
  • Digital Ocean - App platform style
  • Microsoft 365 - Organization/tenant style

Integration Points

Logto (Centralized Auth)

All Services → Logto OIDC
├── O2S (customer portal)
├── Paymenter (billing)
├── Backstage (internal)
└── Management API

Paymenter (Billing)

Customer → Paymenter → Proxmox
         → Lago (usage metering)

Temporal (Orchestration)

O2S → Temporal Workflow
    ├── Reserve quota
    ├── Paymenter.create_service
    ├── Proxmox.create_vm
    ├── Salt.apply_states
    ├── NetBox.register
    └── NetBird.join

Multi-Cloud Strategy (Future)

Phase 1: Proxmox Primary

  • 100% self-hosted
  • Binary Lane as backup

Phase 2: Failover Capabilities

  • Proxmox (primary)
  • Binary Lane (secondary)
  • OVHcloud (tertiary)
  • Hetzner (quaternary)

Phase 3: Full Multi-Cloud

  • Smart scheduling across providers
  • Cost-based routing
  • Geographic distribution

Migration Timeline

Phase 1 (Months 1-2): Foundation

  • Deploy Logto, O2S, Paymenter
  • Basic tenant isolation
  • Ceph storage setup

Phase 2 (Months 3-4): Kubernetes

  • Talos Linux deployment
  • vCluster setup
  • Coolify integration

Phase 3 (Months 5-6): Platform

  • Crossplane integration
  • Template system
  • AI assistant

Phase 4 (Months 7-8): Polish

  • Backstage portal
  • Advanced monitoring
  • Documentation

Phase 5 (Months 9-12): Scale

  • Multi-cloud failover
  • Advanced networking
  • Enterprise features

Key Decisions

Yes

  • ✅ Talos Linux for K8s
  • ✅ Ceph for shared storage
  • ✅ vCluster for multi-tenancy
  • ✅ Coolify for simple containers
  • ✅ Crossplane for IaC
  • ✅ Rust for core services
  • ✅ Temporal for workflows

No

  • ❌ Backstage for customer deployments
  • ❌ Pulumi (use OpenTofu)
  • ❌ Direct Proxmox for apps (use abstraction)
  • ❌ BGP for MVP

Maybe (Future)

  • 🤔 Coolify long-term (vs custom)
  • 🤔 Full multi-cloud (Phase 5)
  • 🤔 Advanced SDN automation

Documentation Structure

content/docs/
├── index.mdx
├── proxmox.mdx        # Proxmox/KVM/LXC
├── talos.mdx          # Talos Linux
├── vcluster.mdx       # Virtual K8s clusters
├── crossplane.mdx     # Infrastructure as code
├── vps.mdx            # VM management
├── templates.mdx      # Template system
├── kubernetes.mdx     # K8s management
├── storage.mdx        # Ceph/storage
├── cli.mdx            # CLI reference
├── api.mdx            # API documentation
├── security.mdx       # Security practices
└── twnstack.mdx       # TWN-specific

Success Metrics

  • Customer provisions VM in < 5 minutes
  • 99% provisioning success rate
  • 10 concurrent customers (MVP)
  • < 1 hour Ceph migration time
  • Zero-downtime Kubernetes updates

Contact

For questions or clarifications on this architecture, contact the Tasmanian Cloud engineering team.