Software engineer with expertise in backend and cloud-native development, Kubernetes and LLM application development.
Aktualisiert am 26.11.2025
Profil
Freiberufler / Selbstständiger
Remote-Arbeit
Verfügbar ab: 01.12.2025
Verfügbar zu: 100%
davon vor Ort: 100%
Software
DevOps
Cloud
Node.js
Cloud Foundry
Kubernetes
Golang
AWS
prometheus
SystemArchitektur
Container
OpenAI
Chatbot
LangChain
Software-Entwicklung
cloud engineer
English
Fluent
German
Proficient

Einsatzorte

Einsatzorte

Heidelberg (+500km) Zürich (+50km)
Deutschland, Schweiz

möglich

Projekte

Projekte

1 year 11 months
2024-01 - 2025-11

Observability Engineering (AI/ML)

Observability Engineer (AI/ML) Prometheus Grafana Kubernetes ...
Observability Engineer (AI/ML)

Designed and implemented a fully instrumented, cloud-native observability and telemetry framework for hosted, fine-tuned, and proxied LLMs operating in enterprise-grade production environments. Delivered end-to-end visibility into AI/ML training pipelines, inference workloads, and model serving infrastructure. Developed unified tracing, logging, and metrics pipelines leveraging distributed-systems observability standards to surface granular insights across token-level usage, GPU/CPU resource saturation, container/node performance, latency distributions (P50?P99), error propagation, and emergent model-behavior ?flares.? Integrated cross-cloud observability stacks to support proactive SRE practices, automated incident triage, drift detection, and optimization of high-throughput AI workloads.

  • Architected end-to-end observability pipelines for LLM APIs and model-serving runtimes using OpenTelemetry SDKs/collectors, Prometheus exporters, and Kubernetes operators, instrumenting the full lifecycle of model training, inference, and system-level resource utilization.
  • Instrumented model endpoints, batch/stream training jobs, and inference gateways to capture high-resolution metrics such as tail latency, throughput (RPS/QPS), token-per-second performance, GPU memory fragmentation, multi-node utilization, error budgets, and anomaly detection signals via statistical and ML-based detectors.
  • Implemented automated alerting and SLO/SLA monitoring using Prometheus Alertmanager, and custom anomaly-detection pipelines to identify inference latency regressions, GPU/CPU saturation events, memory leaks, container restarts, or failed model-training runs?reducing mean time to detect (MTTD) and mean time to resolve (MTTR).
  • Conducted architectural reviews, telemetry schema standardization, and iterative refinements to reduce observability overhead, optimize scrape intervals, improve sampling strategies, and enhance distributed tracing efficiency across multi-cloud LLM deployments.
  • Collaborated with ML Ops, SRE, and platform engineering teams to integrate telemetry into CI/CD pipelines, automate environment drift detection, and enable data-driven scaling policies for training and inference clusters.

Prometheus Grafana Kubernetes Helm Python OpenTelemetry Promitor Dynatrace Go Loki Jaeger Tempo KEDA HelmArgoCD MLflow
SAP
1 year 11 months
2024-01 - 2025-11

AI/ML Platform Engineering

AI/ML Platform Engineering LLM as Backend SystemArchitektur Back-End ...
AI/ML Platform Engineering
Built a scalable GenAI platform for enterprise workloads, integrating advanced LLM-as-a-Backend capabilities with support for RAG pipelines, model routing/orchestration, fine-tuning, and content-moderation workflows. Delivered cross-language SDKs, internal CLI tools, and fully automated CI/CD pipelines to streamline AI adoption for product teams while optimizing for cost efficiency, security, and operational reliability.


Responsibilities:

  • Designed and implemented a that dynamically selected LLM backends based on latency, token cost, throughput, and task-specific quality, including fallback and failover strategies across regions/providers to meet strict SLAs.

  • Built using Azure OpenAI and Azure Cognitive Services, tuning concurrency, batching, and rate-limit handling to improve domain relevance, safety compliance, and multilingual support in enterprise applications.

  • Automated of GenAI workloads using Argo Workflows, ArgoCD, Jenkins, and GitOps-based configuration management, enabling repeatable infrastructure changes, reducing configuration drift, and achieving ~40% faster release cycles.

  • Delivered ?including Python and Go SDKs, internal CLI/automation scripts, and out-of-the-box observability integrations (Prometheus/Grafana)?that reduced integration effort, standardized access patterns to LLM services, and accelerated AI adoption across product teams.

  • Ensured by monitoring token utilization, CPU/GPU consumption (where applicable), autoscaling behavior, and model-specific SLAs; implemented per-tenant quotas, rate limiting, and governance controls to provide fair resource allocation, security, and full auditability in a multi-tenant environment.

  • Collaborated with platform and infrastructure teams to tuning GPU requests/limits, pod placement, and batching strategies to maximize GPU utilization while meeting p95/p99 latency and uptime objectives.

  • Deployed NVIDIA GPU Operator in production Kubernetes clusters to deliver reliable, automated GPU driver/tooling management and expose robust Prometheus-ready GPU telemetry.

LLM as Backend SystemArchitektur Back-End Python Prompt-Engineering Azure OpenAI RAG ArgoCD Jenkins GitOps Go Kserve Knative Kuberntes Inference Azure Devops
SAP
10 months
2023-03 - 2023-12

Agentic AI Platform & RAG Excellence Framework

GenAI Engineer Python OpenAI LangChain ...
GenAI Engineer

Designed and delivered a production-grade agentic AI platform supporting full-stack development, orchestration, deployment, and observability of LLM-driven autonomous agents across multiple enterprise business units. Implemented standardized blueprints for agent topologies, tool invocation layers, RAG/Indexing pipelines, and LLMOps workflows, ensuring horizontal scalability, fault tolerance, auditability, and regulatory alignment within a highly controlled financial-services environment. Built comprehensive evaluation, governance, and safety frameworks that accelerated organizational adoption of AI copilots and significantly reduced time-to-market for new intelligent-automation workloads.

Responsibilities:
  • Owned full-lifecycle delivery of agentic AI and ML initiatives?from problem scoping, feature engineering, and dataset curation to model development, quantitative/qualitative evaluation, deployment, and post-production monitoring.
  • Architected advanced agent systems (e.g., planner?executor, hierarchical/multi-agent, tool-augmented agents) leveraging memory modules, reflection loops, action-scoring policies, and controllability/safety constraints.
  • Designed and optimized RAG pipelines, including document ingestion, chunking heuristics, embedding generation, vector store configuration, hybrid retrieval, reranking, caching, and evaluation frameworks for precision/recall, hallucination rate, and latency SLAs.
  • ? Implemented MCP for standardized, permissioned tool integrations and orchestrated heterogeneous LLM workloads using LangChain, LlamaIndex and custom microservices for tool execution.
  • Established enterprise-grade LLMOps practices including experiment tracking (MLflow/Weights & Biases), dataset/prompt versioning (DVC/Git), CI/CD pipelines (GitHub Actions/Azure DevOps), model registries, workload autoscaling, telemetry, drift detection, and incident-response runbooks.
  • Enforced reliability, safety, and compliance controls through prompt-injection defenses, schema validation, content-moderation pipelines, differential access controls, policy enforcement layers, adversarial/red-teaming evaluations, and pre-production quality gates.

Python OpenAI LangChain Kubernetes RAG Chatbot Go (Golang) Firebase Firestore LlamaIndex Semantic Kernel Vector Databases Redis MCP AWS FastAPI async asynchronous programming
Remote
11 months
2022-05 - 2023-03

Development of a distributed orchestration system

Senior software engineer Go(lang) WebSocket OpenSearch ...
Senior software engineer
Built a custom container-orchestration and control-plane platform inspired by Kubernetes, enabling automated cloud resource provisioning for hundreds of products across thousands of tenants. The platform was engineered for horizontal scalability, high availability, and low-latency rollout pipelines.


Responsibilities:

  • Designed and implemented the core control-plane architecture?including an API Server with RBAC, Controller Manager, Scheduler-like reconciliation loops, Namespace isolation, and CRD-style resource definitions?to support secure, multi-tenant operations at scale.
  • Developed client SDKs and code-generation tools to streamline custom controller development for internal engineering teams, ensuring consistency with the platform?s declarative model.
  • Integrated WebSocket for real-time event streaming between control plane and worker nodes and utilized Redis  for distributed caching and message brokering.
  • Employed LocalStack for local AWS service emulation in CI/CD pipelines.
  • Integrated Prometheus, Alertmanager, and Grafana for end-to-end metrics instrumentation, distributed tracing, and proactive anomaly detection. Implemented service liveness/readiness probes, exporters, and dashboards to improve reliability and observability.
  • Implemented Kubernetes-native patterns: Deployments, StatefulSets, DaemonSets, custom operators.

Go(lang) WebSocket OpenSearch LocalStack Redis Prometheus Grafana AWS S3 Kubernetes RBAC Distributed Systems Webhook
SAP
Walldorf
1 year 8 months
2020-09 - 2022-04

Enablement of SAP Analytics Cloud?s SaaS offering

Senior cloud engineer Cloud Foundry Node.js Java ...
Senior cloud engineer
Developed core SaaS capabilities for SAP Analytics Cloud, including service broker, billing, and metering microservices, enabling pay-as-you-go access.
  • Built service broker for Cloud Foundry using Node.js.
  • Developed billing and metering microservices to track and charge resource usage.
  • Implemented Prometheus/Grafana monitoring to ensure system reliability.
  • Designed APIs for seamless microservice integration and external communication.
  • Ensured service scalability, supporting hundreds of concurrent tenants reliably.
Cloud Foundry Node.js Java Prometheus API gateway Redis Postgres Grafana
Walldorf
1 year 5 months
2019-04 - 2020-08

Developed cloud infra. for the SAP HANA-as-a-Service

DevOps engineer HashiCorp (Terraform/ Vault/ Consul) Ansible AWS (VPC/ EC2/ S3/ Glacier/ Cloud Watch/ API Gateway) ...
DevOps engineer
Built robust cloud infrastructure on AWS for HANA-as-a-Service, automating deployments, upgrades, and multi-region provisioning.
  • Developed Terraform and Ansible playbooks for installation and automated upgrades.
  • Built a lightweight agent to respond to Consul changes, triggering relevant playbooks.
  • Implemented APIs for customer HANA system orders.
  • Designed secure VPC architecture and automated backup/recovery workflows using AWS S3 and Glacier.
  • Reduced deployment time by 40% while maintaining compliance and security standards.
HashiCorp (Terraform/ Vault/ Consul) Ansible AWS (VPC/ EC2/ S3/ Glacier/ Cloud Watch/ API Gateway) Cloud Foundry Python Go (Golang) Bash
SAP
10 months
2018-07 - 2019-04

Development of an elastic caching microservice

Software Engineer Go(lang) Redis MongoDB ...
Software Engineer

Developed an elastic caching microservice in Go to accelerate analytical query performance for a multi-tenant analytics platform, implementing context-aware caching with user permissions, roles, and cube dimension metadata.

Implementation:
  • Built cloud-native microservice following 12-factor app principles with stateless design, externalized configuration, and graceful shutdown
  • Implemented multi-layer caching strategy using Redis (L1 in-memory cache with TTL/LRU eviction) and MongoDB (L2 persistent cache for complex query metadata)
  • Designed cache key generation algorithm incorporating RBAC permissions, tenant isolation, and OLAP cube context (dimensions, measures, filters)
  • Developed cache invalidation strategies with pub/sub patterns for real-time data updates
  • Integrated Prometheus with custom metrics (cache hit/miss ratios, query latency percentiles, eviction rates)
  • Built Grafana dashboards for real-time performance monitoring and capacity planning
Go(lang) Redis MongoDB Kubernetes Prometheus 12-factor app
SAP

Aus- und Weiterbildung

Aus- und Weiterbildung

2014 - 2017
Distributed Software Systems
TU Darmstadt (Germany)
Degree: Master of Science

Position

Position

Software and DevOps engineer with focus on cloud-native development and LLM application development.

Kompetenzen

Kompetenzen

Top-Skills

Software DevOps Cloud Node.js Cloud Foundry Kubernetes Golang AWS prometheus SystemArchitektur Container OpenAI Chatbot LangChain Software-Entwicklung cloud engineer

Schwerpunkte

AI/ML Platform Engineering & LLMOps
Fortgeschritten
Cloud-Native Observability & SRE
Experte
Distributed Systems & Platform Engineering
Fortgeschritten

AI/ML Platform Engineering & LLMOps

Deep expertise in building production-grade GenAI platforms and agentic AI systems, with comprehensive experience in LLM deployment, fine-tuning, RAG pipelines, and model orchestration. Specialized in architecting multi-tenant AI infrastructure that balances performance, cost optimization, and enterprise security requirements.


Cloud-Native Observability and SRE

Expert in designing end-to-end observability solutions for distributed systems and AI/ML workloads using OpenTelemetry, Prometheus, and Grafana. Proven ability to instrument complex environments from token-level metrics to infrastructure telemetry, enabling proactive incident management, anomaly detection, and data-driven optimization of high-throughput systems.


Distributed Systems and Platform Engineering

Strong foundation in building scalable, cloud-native platforms with expertise in Kubernetes ecosystem, control-plane architecture, and microservices orchestration. Skilled in implementing GitOps workflows, CI/CD automation, and infrastructure-as-code practices to deliver reliable, self-service platforms for enterprise-scale deployments.

Aufgabenbereiche

System Architecture
Fortgeschritten
Software Engineering
Experte
DevOps
Fortgeschritten
  • Architecture and implementation of enterprise GenAI platforms supporting RAG, fine-tuning, model routing, and content moderation workflows
  • Design and deployment of observability frameworks for AI/ML systems, including distributed tracing, metrics pipelines, and SLO/SLA monitoring
  • Development of autonomous agent systems with tool integration, memory modules, and safety/governance controls
  • Building cloud-native microservices and APIs for multi-tenant SaaS offerings with focus on scalability and reliability
  • Infrastructure automation using GitOps, CI/CD pipelines, and infrastructure-as-code across AWS and Azure environments
  • Implementation of control-plane architectures for container orchestration and resource provisioning at scale
  • Establishment of LLMOps practices including experiment tracking, model versioning, drift detection, and compliance enforcement
  • Performance optimization through caching strategies, autoscaling policies, and resource utilization monitoring
  • Cross-functional collaboration with ML Ops, SRE, and platform engineering teams to accelerate AI adoption
  • Security implementation including RBAC, multi-tenant isolation, and content moderation pipelines

Produkte / Standards / Erfahrungen / Methoden

DevOps
Fortgeschritten
Software
Experte
AWS
Fortgeschritten
OpenAI
Experte
Kubernetes
Fortgeschritten
Observability
Fortgeschritten
GenAI
Experte
Development
Fortgeschritten
Profile
As a freelance software engineer, I deliver tailored, scalable software solutions for enterprise systems. My focus spans software architecture and development, DevOps and LLM-powered AI applications, with strong expertise in building reliable, observable and cost-efficient distributed systems on AWS and Azure.

AI/ML & GenAI
OpenAI, Azure OpenAI, LangChain, LlamaIndex, Semantic Kernel, RAG (Retrieval-Augmented Generation), Prompt Engineering, LLM Fine-tuning, Model Inference, Vector Databases, MLflow, Kserve, Knative, MCP (Model Context Protocol), Chatbot Development, Agentic AI Systems


Cloud Platforms & Services

AWS (VPC, EC2, S3, Glacier, CloudWatch, API Gateway), Azure DevOps, Azure Cognitive Services, Cloud Foundry, Multi-cloud Architecture


Container Orchestration & Infrastructure

Kubernetes, Helm, ArgoCD, Argo Workflows, Docker, StatefulSets, DaemonSets, Custom Operators, Control Plane Architecture


Observability & Monitoring

OpenTelemetry, Prometheus, Grafana, Dynatrace, Promitor, Loki, Jaeger, Tempo, Alertmanager, Distributed Tracing, Metrics Engineering, SLO/SLA Monitoring


Programming Languages

Python, Go (Golang), Node.js, Java, Bash


Data Storage & Caching

Redis, MongoDB, PostgreSQL, Firebase Firestore, Vector Databases, OpenSearch, AWS S3


DevOps & Automation

GitOps, Jenkins, Terraform, Ansible, HashiCorp (Vault, Consul, Terraform), LocalStack, CI/CD Pipelines, GitHub Actions, Infrastructure-as-Code (IaC)


Networking & Communication

REST APIs, WebSocket, API Gateway, RBAC, Service Mesh


Development Practices & Patterns

Microservices Architecture, 12-Factor App Principles, SRE Practices, LLMOps, MLOps, Multi-tenant Design, Distributed Systems, Event-Driven Architecture, KEDA (Kubernetes Event-Driven Autoscaling)


Security & Compliance

RBAC (Role-Based Access Control), Content Moderation, Prompt Injection Defense, Multi-tenant Isolation, Policy Enforcement, Adversarial Testing


Data & ML Tools

DVC (Data Version Control), Weights & Biases, Model Registries, Experiment Tracking, Dataset Versioning

Betriebssysteme

Linux

Programmiersprachen

Go (Golang)
Python
Java
Node.js
Postgres
MongoDB
Firebase

Datenbanken

PostgresSQL
MongoDB
Redis
Firestore
Elasticsearch

Einsatzorte

Einsatzorte

Heidelberg (+500km) Zürich (+50km)
Deutschland, Schweiz

möglich

Projekte

Projekte

1 year 11 months
2024-01 - 2025-11

Observability Engineering (AI/ML)

Observability Engineer (AI/ML) Prometheus Grafana Kubernetes ...
Observability Engineer (AI/ML)

Designed and implemented a fully instrumented, cloud-native observability and telemetry framework for hosted, fine-tuned, and proxied LLMs operating in enterprise-grade production environments. Delivered end-to-end visibility into AI/ML training pipelines, inference workloads, and model serving infrastructure. Developed unified tracing, logging, and metrics pipelines leveraging distributed-systems observability standards to surface granular insights across token-level usage, GPU/CPU resource saturation, container/node performance, latency distributions (P50?P99), error propagation, and emergent model-behavior ?flares.? Integrated cross-cloud observability stacks to support proactive SRE practices, automated incident triage, drift detection, and optimization of high-throughput AI workloads.

  • Architected end-to-end observability pipelines for LLM APIs and model-serving runtimes using OpenTelemetry SDKs/collectors, Prometheus exporters, and Kubernetes operators, instrumenting the full lifecycle of model training, inference, and system-level resource utilization.
  • Instrumented model endpoints, batch/stream training jobs, and inference gateways to capture high-resolution metrics such as tail latency, throughput (RPS/QPS), token-per-second performance, GPU memory fragmentation, multi-node utilization, error budgets, and anomaly detection signals via statistical and ML-based detectors.
  • Implemented automated alerting and SLO/SLA monitoring using Prometheus Alertmanager, and custom anomaly-detection pipelines to identify inference latency regressions, GPU/CPU saturation events, memory leaks, container restarts, or failed model-training runs?reducing mean time to detect (MTTD) and mean time to resolve (MTTR).
  • Conducted architectural reviews, telemetry schema standardization, and iterative refinements to reduce observability overhead, optimize scrape intervals, improve sampling strategies, and enhance distributed tracing efficiency across multi-cloud LLM deployments.
  • Collaborated with ML Ops, SRE, and platform engineering teams to integrate telemetry into CI/CD pipelines, automate environment drift detection, and enable data-driven scaling policies for training and inference clusters.

Prometheus Grafana Kubernetes Helm Python OpenTelemetry Promitor Dynatrace Go Loki Jaeger Tempo KEDA HelmArgoCD MLflow
SAP
1 year 11 months
2024-01 - 2025-11

AI/ML Platform Engineering

AI/ML Platform Engineering LLM as Backend SystemArchitektur Back-End ...
AI/ML Platform Engineering
Built a scalable GenAI platform for enterprise workloads, integrating advanced LLM-as-a-Backend capabilities with support for RAG pipelines, model routing/orchestration, fine-tuning, and content-moderation workflows. Delivered cross-language SDKs, internal CLI tools, and fully automated CI/CD pipelines to streamline AI adoption for product teams while optimizing for cost efficiency, security, and operational reliability.


Responsibilities:

  • Designed and implemented a that dynamically selected LLM backends based on latency, token cost, throughput, and task-specific quality, including fallback and failover strategies across regions/providers to meet strict SLAs.

  • Built using Azure OpenAI and Azure Cognitive Services, tuning concurrency, batching, and rate-limit handling to improve domain relevance, safety compliance, and multilingual support in enterprise applications.

  • Automated of GenAI workloads using Argo Workflows, ArgoCD, Jenkins, and GitOps-based configuration management, enabling repeatable infrastructure changes, reducing configuration drift, and achieving ~40% faster release cycles.

  • Delivered ?including Python and Go SDKs, internal CLI/automation scripts, and out-of-the-box observability integrations (Prometheus/Grafana)?that reduced integration effort, standardized access patterns to LLM services, and accelerated AI adoption across product teams.

  • Ensured by monitoring token utilization, CPU/GPU consumption (where applicable), autoscaling behavior, and model-specific SLAs; implemented per-tenant quotas, rate limiting, and governance controls to provide fair resource allocation, security, and full auditability in a multi-tenant environment.

  • Collaborated with platform and infrastructure teams to tuning GPU requests/limits, pod placement, and batching strategies to maximize GPU utilization while meeting p95/p99 latency and uptime objectives.

  • Deployed NVIDIA GPU Operator in production Kubernetes clusters to deliver reliable, automated GPU driver/tooling management and expose robust Prometheus-ready GPU telemetry.

LLM as Backend SystemArchitektur Back-End Python Prompt-Engineering Azure OpenAI RAG ArgoCD Jenkins GitOps Go Kserve Knative Kuberntes Inference Azure Devops
SAP
10 months
2023-03 - 2023-12

Agentic AI Platform & RAG Excellence Framework

GenAI Engineer Python OpenAI LangChain ...
GenAI Engineer

Designed and delivered a production-grade agentic AI platform supporting full-stack development, orchestration, deployment, and observability of LLM-driven autonomous agents across multiple enterprise business units. Implemented standardized blueprints for agent topologies, tool invocation layers, RAG/Indexing pipelines, and LLMOps workflows, ensuring horizontal scalability, fault tolerance, auditability, and regulatory alignment within a highly controlled financial-services environment. Built comprehensive evaluation, governance, and safety frameworks that accelerated organizational adoption of AI copilots and significantly reduced time-to-market for new intelligent-automation workloads.

Responsibilities:
  • Owned full-lifecycle delivery of agentic AI and ML initiatives?from problem scoping, feature engineering, and dataset curation to model development, quantitative/qualitative evaluation, deployment, and post-production monitoring.
  • Architected advanced agent systems (e.g., planner?executor, hierarchical/multi-agent, tool-augmented agents) leveraging memory modules, reflection loops, action-scoring policies, and controllability/safety constraints.
  • Designed and optimized RAG pipelines, including document ingestion, chunking heuristics, embedding generation, vector store configuration, hybrid retrieval, reranking, caching, and evaluation frameworks for precision/recall, hallucination rate, and latency SLAs.
  • ? Implemented MCP for standardized, permissioned tool integrations and orchestrated heterogeneous LLM workloads using LangChain, LlamaIndex and custom microservices for tool execution.
  • Established enterprise-grade LLMOps practices including experiment tracking (MLflow/Weights & Biases), dataset/prompt versioning (DVC/Git), CI/CD pipelines (GitHub Actions/Azure DevOps), model registries, workload autoscaling, telemetry, drift detection, and incident-response runbooks.
  • Enforced reliability, safety, and compliance controls through prompt-injection defenses, schema validation, content-moderation pipelines, differential access controls, policy enforcement layers, adversarial/red-teaming evaluations, and pre-production quality gates.

Python OpenAI LangChain Kubernetes RAG Chatbot Go (Golang) Firebase Firestore LlamaIndex Semantic Kernel Vector Databases Redis MCP AWS FastAPI async asynchronous programming
Remote
11 months
2022-05 - 2023-03

Development of a distributed orchestration system

Senior software engineer Go(lang) WebSocket OpenSearch ...
Senior software engineer
Built a custom container-orchestration and control-plane platform inspired by Kubernetes, enabling automated cloud resource provisioning for hundreds of products across thousands of tenants. The platform was engineered for horizontal scalability, high availability, and low-latency rollout pipelines.


Responsibilities:

  • Designed and implemented the core control-plane architecture?including an API Server with RBAC, Controller Manager, Scheduler-like reconciliation loops, Namespace isolation, and CRD-style resource definitions?to support secure, multi-tenant operations at scale.
  • Developed client SDKs and code-generation tools to streamline custom controller development for internal engineering teams, ensuring consistency with the platform?s declarative model.
  • Integrated WebSocket for real-time event streaming between control plane and worker nodes and utilized Redis  for distributed caching and message brokering.
  • Employed LocalStack for local AWS service emulation in CI/CD pipelines.
  • Integrated Prometheus, Alertmanager, and Grafana for end-to-end metrics instrumentation, distributed tracing, and proactive anomaly detection. Implemented service liveness/readiness probes, exporters, and dashboards to improve reliability and observability.
  • Implemented Kubernetes-native patterns: Deployments, StatefulSets, DaemonSets, custom operators.

Go(lang) WebSocket OpenSearch LocalStack Redis Prometheus Grafana AWS S3 Kubernetes RBAC Distributed Systems Webhook
SAP
Walldorf
1 year 8 months
2020-09 - 2022-04

Enablement of SAP Analytics Cloud?s SaaS offering

Senior cloud engineer Cloud Foundry Node.js Java ...
Senior cloud engineer
Developed core SaaS capabilities for SAP Analytics Cloud, including service broker, billing, and metering microservices, enabling pay-as-you-go access.
  • Built service broker for Cloud Foundry using Node.js.
  • Developed billing and metering microservices to track and charge resource usage.
  • Implemented Prometheus/Grafana monitoring to ensure system reliability.
  • Designed APIs for seamless microservice integration and external communication.
  • Ensured service scalability, supporting hundreds of concurrent tenants reliably.
Cloud Foundry Node.js Java Prometheus API gateway Redis Postgres Grafana
Walldorf
1 year 5 months
2019-04 - 2020-08

Developed cloud infra. for the SAP HANA-as-a-Service

DevOps engineer HashiCorp (Terraform/ Vault/ Consul) Ansible AWS (VPC/ EC2/ S3/ Glacier/ Cloud Watch/ API Gateway) ...
DevOps engineer
Built robust cloud infrastructure on AWS for HANA-as-a-Service, automating deployments, upgrades, and multi-region provisioning.
  • Developed Terraform and Ansible playbooks for installation and automated upgrades.
  • Built a lightweight agent to respond to Consul changes, triggering relevant playbooks.
  • Implemented APIs for customer HANA system orders.
  • Designed secure VPC architecture and automated backup/recovery workflows using AWS S3 and Glacier.
  • Reduced deployment time by 40% while maintaining compliance and security standards.
HashiCorp (Terraform/ Vault/ Consul) Ansible AWS (VPC/ EC2/ S3/ Glacier/ Cloud Watch/ API Gateway) Cloud Foundry Python Go (Golang) Bash
SAP
10 months
2018-07 - 2019-04

Development of an elastic caching microservice

Software Engineer Go(lang) Redis MongoDB ...
Software Engineer

Developed an elastic caching microservice in Go to accelerate analytical query performance for a multi-tenant analytics platform, implementing context-aware caching with user permissions, roles, and cube dimension metadata.

Implementation:
  • Built cloud-native microservice following 12-factor app principles with stateless design, externalized configuration, and graceful shutdown
  • Implemented multi-layer caching strategy using Redis (L1 in-memory cache with TTL/LRU eviction) and MongoDB (L2 persistent cache for complex query metadata)
  • Designed cache key generation algorithm incorporating RBAC permissions, tenant isolation, and OLAP cube context (dimensions, measures, filters)
  • Developed cache invalidation strategies with pub/sub patterns for real-time data updates
  • Integrated Prometheus with custom metrics (cache hit/miss ratios, query latency percentiles, eviction rates)
  • Built Grafana dashboards for real-time performance monitoring and capacity planning
Go(lang) Redis MongoDB Kubernetes Prometheus 12-factor app
SAP

Aus- und Weiterbildung

Aus- und Weiterbildung

2014 - 2017
Distributed Software Systems
TU Darmstadt (Germany)
Degree: Master of Science

Position

Position

Software and DevOps engineer with focus on cloud-native development and LLM application development.

Kompetenzen

Kompetenzen

Top-Skills

Software DevOps Cloud Node.js Cloud Foundry Kubernetes Golang AWS prometheus SystemArchitektur Container OpenAI Chatbot LangChain Software-Entwicklung cloud engineer

Schwerpunkte

AI/ML Platform Engineering & LLMOps
Fortgeschritten
Cloud-Native Observability & SRE
Experte
Distributed Systems & Platform Engineering
Fortgeschritten

AI/ML Platform Engineering & LLMOps

Deep expertise in building production-grade GenAI platforms and agentic AI systems, with comprehensive experience in LLM deployment, fine-tuning, RAG pipelines, and model orchestration. Specialized in architecting multi-tenant AI infrastructure that balances performance, cost optimization, and enterprise security requirements.


Cloud-Native Observability and SRE

Expert in designing end-to-end observability solutions for distributed systems and AI/ML workloads using OpenTelemetry, Prometheus, and Grafana. Proven ability to instrument complex environments from token-level metrics to infrastructure telemetry, enabling proactive incident management, anomaly detection, and data-driven optimization of high-throughput systems.


Distributed Systems and Platform Engineering

Strong foundation in building scalable, cloud-native platforms with expertise in Kubernetes ecosystem, control-plane architecture, and microservices orchestration. Skilled in implementing GitOps workflows, CI/CD automation, and infrastructure-as-code practices to deliver reliable, self-service platforms for enterprise-scale deployments.

Aufgabenbereiche

System Architecture
Fortgeschritten
Software Engineering
Experte
DevOps
Fortgeschritten
  • Architecture and implementation of enterprise GenAI platforms supporting RAG, fine-tuning, model routing, and content moderation workflows
  • Design and deployment of observability frameworks for AI/ML systems, including distributed tracing, metrics pipelines, and SLO/SLA monitoring
  • Development of autonomous agent systems with tool integration, memory modules, and safety/governance controls
  • Building cloud-native microservices and APIs for multi-tenant SaaS offerings with focus on scalability and reliability
  • Infrastructure automation using GitOps, CI/CD pipelines, and infrastructure-as-code across AWS and Azure environments
  • Implementation of control-plane architectures for container orchestration and resource provisioning at scale
  • Establishment of LLMOps practices including experiment tracking, model versioning, drift detection, and compliance enforcement
  • Performance optimization through caching strategies, autoscaling policies, and resource utilization monitoring
  • Cross-functional collaboration with ML Ops, SRE, and platform engineering teams to accelerate AI adoption
  • Security implementation including RBAC, multi-tenant isolation, and content moderation pipelines

Produkte / Standards / Erfahrungen / Methoden

DevOps
Fortgeschritten
Software
Experte
AWS
Fortgeschritten
OpenAI
Experte
Kubernetes
Fortgeschritten
Observability
Fortgeschritten
GenAI
Experte
Development
Fortgeschritten
Profile
As a freelance software engineer, I deliver tailored, scalable software solutions for enterprise systems. My focus spans software architecture and development, DevOps and LLM-powered AI applications, with strong expertise in building reliable, observable and cost-efficient distributed systems on AWS and Azure.

AI/ML & GenAI
OpenAI, Azure OpenAI, LangChain, LlamaIndex, Semantic Kernel, RAG (Retrieval-Augmented Generation), Prompt Engineering, LLM Fine-tuning, Model Inference, Vector Databases, MLflow, Kserve, Knative, MCP (Model Context Protocol), Chatbot Development, Agentic AI Systems


Cloud Platforms & Services

AWS (VPC, EC2, S3, Glacier, CloudWatch, API Gateway), Azure DevOps, Azure Cognitive Services, Cloud Foundry, Multi-cloud Architecture


Container Orchestration & Infrastructure

Kubernetes, Helm, ArgoCD, Argo Workflows, Docker, StatefulSets, DaemonSets, Custom Operators, Control Plane Architecture


Observability & Monitoring

OpenTelemetry, Prometheus, Grafana, Dynatrace, Promitor, Loki, Jaeger, Tempo, Alertmanager, Distributed Tracing, Metrics Engineering, SLO/SLA Monitoring


Programming Languages

Python, Go (Golang), Node.js, Java, Bash


Data Storage & Caching

Redis, MongoDB, PostgreSQL, Firebase Firestore, Vector Databases, OpenSearch, AWS S3


DevOps & Automation

GitOps, Jenkins, Terraform, Ansible, HashiCorp (Vault, Consul, Terraform), LocalStack, CI/CD Pipelines, GitHub Actions, Infrastructure-as-Code (IaC)


Networking & Communication

REST APIs, WebSocket, API Gateway, RBAC, Service Mesh


Development Practices & Patterns

Microservices Architecture, 12-Factor App Principles, SRE Practices, LLMOps, MLOps, Multi-tenant Design, Distributed Systems, Event-Driven Architecture, KEDA (Kubernetes Event-Driven Autoscaling)


Security & Compliance

RBAC (Role-Based Access Control), Content Moderation, Prompt Injection Defense, Multi-tenant Isolation, Policy Enforcement, Adversarial Testing


Data & ML Tools

DVC (Data Version Control), Weights & Biases, Model Registries, Experiment Tracking, Dataset Versioning

Betriebssysteme

Linux

Programmiersprachen

Go (Golang)
Python
Java
Node.js
Postgres
MongoDB
Firebase

Datenbanken

PostgresSQL
MongoDB
Redis
Firestore
Elasticsearch

Vertrauen Sie auf Randstad

Im Bereich Freelancing
Im Bereich Arbeitnehmerüberlassung / Personalvermittlung

Fragen?

Rufen Sie uns an +49 89 500316-300 oder schreiben Sie uns:

Das Freelancer-Portal

Direktester geht's nicht! Ganz einfach Freelancer finden und direkt Kontakt aufnehmen.