AI Systems Engineer | Platform Reliability | Production AI, Observability & Automation
Aktualisiert am 07.04.2026
Profil
Freiberufler / Selbstständiger
Remote-Arbeit
Verfügbar ab: 07.04.2026
Verfügbar zu: 100%
davon vor Ort: 100%
Python
DevOps
Kubernetes
Docker
Linux
Monitoringsysteme
Automation/Steuerung
FastAPI
REST
Grafana
Zabbix
Cloud Architect
Incident Management
Applied AI
Site Reliability Engineering
GCP
CI/CD
Platform Engineering
AWS
Shell-Script
SQL
PostgreSQL

Einsatzorte

Einsatzorte

Deutschland, Schweiz, Österreich
möglich

Projekte

Projekte

2 years
2024-04 - now

Independent AI Systems Research and Engineering

  • Designed and built backend-first Python systems with operational concerns such as health checks, readiness, metrics, workflow control, and structured service behavior.
  • Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
  • Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
  • Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
  • Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.
PlainchaosLab
Berlin
3 years 5 months
2020-12 - 2024-04

Python-based alerting and analysis pipeline

Systems Resilience Architect and Operational Analyst
Systems Resilience Architect and Operational Analyst
  • Served as Lead Incident Commander for critical production outages in a high-availability global environment, coordinating investigation, response, and follow-up across teams.
  • Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
  • Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
  • Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
  • Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.
Adjust GmbH
Berlin
2 years
2018-03 - 2020-02

Infrastructure strategy for mission-critical national payment systems

Lead Platform Engineer
Lead Platform Engineer
  • Led infrastructure strategy for mission-critical national payment systems and improved Linux platform availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
  • Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
  • Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
  • Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.
PardakhtNovin Arian Co. (PNA)
Tehran, Iran
5 years 2 months
2013-01 - 2018-02

POS infrastructure

Infrastructure Engineer
Infrastructure Engineer
  • Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
  • Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
  • Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
  • Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.
Pasargad Electronic Payments (PEP)
Tehran, Iran

Aus- und Weiterbildung

Aus- und Weiterbildung

Software Engineering

B.Sc.

Payame Noor University (PNU) | Iran


Certification

AI Project Expert

Kompetenzen

Kompetenzen

Top-Skills

Python DevOps Kubernetes Docker Linux Monitoringsysteme Automation/Steuerung FastAPI REST Grafana Zabbix Cloud Architect Incident Management Applied AI Site Reliability Engineering GCP CI/CD Platform Engineering AWS Shell-Script SQL PostgreSQL

Produkte / Standards / Erfahrungen / Methoden

Profile

  • Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
  • Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.


Core Competencies

  • Site reliability, service operations, outage coordination, RCA, postmortem improvement
  • High-availability design, recovery planning, monitoring, alerting, observability
  • Linux, Python, Bash, SQL, Docker, Kubernetes, CI/CD, Grafana, Zabbix, Ansible
  • Production debugging, failure investigation, incident leadership, runbooks, process improvement

Einsatzorte

Einsatzorte

Deutschland, Schweiz, Österreich
möglich

Projekte

Projekte

2 years
2024-04 - now

Independent AI Systems Research and Engineering

  • Designed and built backend-first Python systems with operational concerns such as health checks, readiness, metrics, workflow control, and structured service behavior.
  • Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
  • Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
  • Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
  • Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.
PlainchaosLab
Berlin
3 years 5 months
2020-12 - 2024-04

Python-based alerting and analysis pipeline

Systems Resilience Architect and Operational Analyst
Systems Resilience Architect and Operational Analyst
  • Served as Lead Incident Commander for critical production outages in a high-availability global environment, coordinating investigation, response, and follow-up across teams.
  • Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
  • Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
  • Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
  • Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.
Adjust GmbH
Berlin
2 years
2018-03 - 2020-02

Infrastructure strategy for mission-critical national payment systems

Lead Platform Engineer
Lead Platform Engineer
  • Led infrastructure strategy for mission-critical national payment systems and improved Linux platform availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
  • Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
  • Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
  • Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.
PardakhtNovin Arian Co. (PNA)
Tehran, Iran
5 years 2 months
2013-01 - 2018-02

POS infrastructure

Infrastructure Engineer
Infrastructure Engineer
  • Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
  • Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
  • Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
  • Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.
Pasargad Electronic Payments (PEP)
Tehran, Iran

Aus- und Weiterbildung

Aus- und Weiterbildung

Software Engineering

B.Sc.

Payame Noor University (PNU) | Iran


Certification

AI Project Expert

Kompetenzen

Kompetenzen

Top-Skills

Python DevOps Kubernetes Docker Linux Monitoringsysteme Automation/Steuerung FastAPI REST Grafana Zabbix Cloud Architect Incident Management Applied AI Site Reliability Engineering GCP CI/CD Platform Engineering AWS Shell-Script SQL PostgreSQL

Produkte / Standards / Erfahrungen / Methoden

Profile

  • Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
  • Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.


Core Competencies

  • Site reliability, service operations, outage coordination, RCA, postmortem improvement
  • High-availability design, recovery planning, monitoring, alerting, observability
  • Linux, Python, Bash, SQL, Docker, Kubernetes, CI/CD, Grafana, Zabbix, Ansible
  • Production debugging, failure investigation, incident leadership, runbooks, process improvement

Vertrauen Sie auf Randstad

Im Bereich Freelancing
Im Bereich Arbeitnehmerüberlassung / Personalvermittlung

Fragen?

Rufen Sie uns an +49 89 500316-300 oder schreiben Sie uns:

Das Freelancer-Portal

Direktester geht's nicht! Ganz einfach Freelancer finden und direkt Kontakt aufnehmen.