Designed and implemented a fully instrumented, cloud-native observability and telemetry framework for hosted, fine-tuned, and proxied LLMs operating in enterprise-grade production environments. Delivered end-to-end visibility into AI/ML training pipelines, inference workloads, and model serving infrastructure. Developed unified tracing, logging, and metrics pipelines leveraging distributed-systems observability standards to surface granular insights across token-level usage, GPU/CPU resource saturation, container/node performance, latency distributions (P50?P99), error propagation, and emergent model-behavior ?flares.? Integrated cross-cloud observability stacks to support proactive SRE practices, automated incident triage, drift detection, and optimization of high-throughput AI workloads.
Responsibilities:
Designed and implemented a that dynamically selected LLM backends based on latency, token cost, throughput, and task-specific quality, including fallback and failover strategies across regions/providers to meet strict SLAs.
Built using Azure OpenAI and Azure Cognitive Services, tuning concurrency, batching, and rate-limit handling to improve domain relevance, safety compliance, and multilingual support in enterprise applications.
Automated of GenAI workloads using Argo Workflows, ArgoCD, Jenkins, and GitOps-based configuration management, enabling repeatable infrastructure changes, reducing configuration drift, and achieving ~40% faster release cycles.
Delivered ?including Python and Go SDKs, internal CLI/automation scripts, and out-of-the-box observability integrations (Prometheus/Grafana)?that reduced integration effort, standardized access patterns to LLM services, and accelerated AI adoption across product teams.
Ensured by monitoring token utilization, CPU/GPU consumption (where applicable), autoscaling behavior, and model-specific SLAs; implemented per-tenant quotas, rate limiting, and governance controls to provide fair resource allocation, security, and full auditability in a multi-tenant environment.
Collaborated with platform and infrastructure teams to tuning GPU requests/limits, pod placement, and batching strategies to maximize GPU utilization while meeting p95/p99 latency and uptime objectives.
Deployed NVIDIA GPU Operator in production Kubernetes clusters to deliver reliable, automated GPU driver/tooling management and expose robust Prometheus-ready GPU telemetry.
Designed and delivered a production-grade agentic AI platform supporting full-stack development, orchestration, deployment, and observability of LLM-driven autonomous agents across multiple enterprise business units. Implemented standardized blueprints for agent topologies, tool invocation layers, RAG/Indexing pipelines, and LLMOps workflows, ensuring horizontal scalability, fault tolerance, auditability, and regulatory alignment within a highly controlled financial-services environment. Built comprehensive evaluation, governance, and safety frameworks that accelerated organizational adoption of AI copilots and significantly reduced time-to-market for new intelligent-automation workloads.
Responsibilities:
2014 - 2017
Distributed Software Systems
TU Darmstadt (Germany)
Degree: Master of Science
AI/ML Platform Engineering & LLMOps
Deep expertise in building production-grade GenAI platforms and agentic AI systems, with comprehensive experience in LLM deployment, fine-tuning, RAG pipelines, and model orchestration. Specialized in architecting multi-tenant AI infrastructure that balances performance, cost optimization, and enterprise security requirements.
Cloud-Native Observability and SRE
Expert in designing end-to-end observability solutions for distributed systems and AI/ML workloads using OpenTelemetry, Prometheus, and Grafana. Proven ability to instrument complex environments from token-level metrics to infrastructure telemetry, enabling proactive incident management, anomaly detection, and data-driven optimization of high-throughput systems.
Distributed Systems and Platform Engineering
Strong foundation in building scalable, cloud-native platforms with expertise in Kubernetes ecosystem, control-plane architecture, and microservices orchestration. Skilled in implementing GitOps workflows, CI/CD automation, and infrastructure-as-code practices to deliver reliable, self-service platforms for enterprise-scale deployments.
Cloud Platforms & Services
Container Orchestration & Infrastructure
Observability & Monitoring
Programming Languages
Data Storage & Caching
DevOps & Automation
Networking & Communication
Development Practices & Patterns
Security & Compliance
Data & ML Tools
Designed and implemented a fully instrumented, cloud-native observability and telemetry framework for hosted, fine-tuned, and proxied LLMs operating in enterprise-grade production environments. Delivered end-to-end visibility into AI/ML training pipelines, inference workloads, and model serving infrastructure. Developed unified tracing, logging, and metrics pipelines leveraging distributed-systems observability standards to surface granular insights across token-level usage, GPU/CPU resource saturation, container/node performance, latency distributions (P50?P99), error propagation, and emergent model-behavior ?flares.? Integrated cross-cloud observability stacks to support proactive SRE practices, automated incident triage, drift detection, and optimization of high-throughput AI workloads.
Responsibilities:
Designed and implemented a that dynamically selected LLM backends based on latency, token cost, throughput, and task-specific quality, including fallback and failover strategies across regions/providers to meet strict SLAs.
Built using Azure OpenAI and Azure Cognitive Services, tuning concurrency, batching, and rate-limit handling to improve domain relevance, safety compliance, and multilingual support in enterprise applications.
Automated of GenAI workloads using Argo Workflows, ArgoCD, Jenkins, and GitOps-based configuration management, enabling repeatable infrastructure changes, reducing configuration drift, and achieving ~40% faster release cycles.
Delivered ?including Python and Go SDKs, internal CLI/automation scripts, and out-of-the-box observability integrations (Prometheus/Grafana)?that reduced integration effort, standardized access patterns to LLM services, and accelerated AI adoption across product teams.
Ensured by monitoring token utilization, CPU/GPU consumption (where applicable), autoscaling behavior, and model-specific SLAs; implemented per-tenant quotas, rate limiting, and governance controls to provide fair resource allocation, security, and full auditability in a multi-tenant environment.
Collaborated with platform and infrastructure teams to tuning GPU requests/limits, pod placement, and batching strategies to maximize GPU utilization while meeting p95/p99 latency and uptime objectives.
Deployed NVIDIA GPU Operator in production Kubernetes clusters to deliver reliable, automated GPU driver/tooling management and expose robust Prometheus-ready GPU telemetry.
Designed and delivered a production-grade agentic AI platform supporting full-stack development, orchestration, deployment, and observability of LLM-driven autonomous agents across multiple enterprise business units. Implemented standardized blueprints for agent topologies, tool invocation layers, RAG/Indexing pipelines, and LLMOps workflows, ensuring horizontal scalability, fault tolerance, auditability, and regulatory alignment within a highly controlled financial-services environment. Built comprehensive evaluation, governance, and safety frameworks that accelerated organizational adoption of AI copilots and significantly reduced time-to-market for new intelligent-automation workloads.
Responsibilities:
2014 - 2017
Distributed Software Systems
TU Darmstadt (Germany)
Degree: Master of Science
AI/ML Platform Engineering & LLMOps
Deep expertise in building production-grade GenAI platforms and agentic AI systems, with comprehensive experience in LLM deployment, fine-tuning, RAG pipelines, and model orchestration. Specialized in architecting multi-tenant AI infrastructure that balances performance, cost optimization, and enterprise security requirements.
Cloud-Native Observability and SRE
Expert in designing end-to-end observability solutions for distributed systems and AI/ML workloads using OpenTelemetry, Prometheus, and Grafana. Proven ability to instrument complex environments from token-level metrics to infrastructure telemetry, enabling proactive incident management, anomaly detection, and data-driven optimization of high-throughput systems.
Distributed Systems and Platform Engineering
Strong foundation in building scalable, cloud-native platforms with expertise in Kubernetes ecosystem, control-plane architecture, and microservices orchestration. Skilled in implementing GitOps workflows, CI/CD automation, and infrastructure-as-code practices to deliver reliable, self-service platforms for enterprise-scale deployments.
Cloud Platforms & Services
Container Orchestration & Infrastructure
Observability & Monitoring
Programming Languages
Data Storage & Caching
DevOps & Automation
Networking & Communication
Development Practices & Patterns
Security & Compliance
Data & ML Tools