Freelancer: AI Systems Engineer | Platform Reliability | Production AI, Observability & Automation

Freiberufler / Selbstst�ndiger

Remote-Arbeit

Verf�gbar ab: 07.04.2026

Verf�gbar zu: 100%

davon vor Ort: 100%

Top-Skills

Python

DevOps

Kubernetes

Docker

Linux

Monitoringsysteme

Automation/Steuerung

FastAPI

REST

Grafana

Zabbix

Cloud Architect

Incident Management

Applied AI

Site Reliability Engineering

GCP

CI/CD

Platform Engineering

AWS

Shell-Script

SQL

PostgreSQL

Einsatzorte

L�nder

Deutschland, Schweiz, �sterreich

Remote-Arbeit

m�glich

Projekte

Projektinhalte

Designed and built backend-first Python systems with operational concerns such as health checks,�readiness, metrics, workflow control, and structured service behavior.
Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.

Kunde

PlainchaosLab

Einsatzort

Berlin

3 years 5 months

2020-12 - 2024-04

Python-based alerting and analysis pipeline

Systems Resilience Architect and Operational Analyst

Rolle

Systems Resilience Architect and Operational Analyst

Projektinhalte

Served as Lead Incident Commander for critical production outages in a high-availability global�environment, coordinating investigation, response, and follow-up across teams.
Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.

Kunde

Adjust GmbH

Einsatzort

Berlin

2 years

2018-03 - 2020-02

Infrastructure strategy for mission-critical national payment systems

Lead Platform Engineer

Rolle

Lead Platform Engineer

Projektinhalte

Led infrastructure strategy for mission-critical national payment systems and improved Linux platform�availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.

Kunde

PardakhtNovin Arian Co. (PNA)

Einsatzort

Tehran, Iran

5 years 2 months

2013-01 - 2018-02

POS infrastructure

Infrastructure Engineer

Rolle

Infrastructure Engineer

Projektinhalte

Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.

Kunde

Pasargad Electronic Payments (PEP)

Einsatzort

Tehran, Iran

Aus- und Weiterbildung

Software Engineering

B.Sc.

Payame Noor University (PNU) | Iran

Certification

AI Project Expert

Kompetenzen

Top-Skills

Python DevOps Kubernetes Docker Linux Monitoringsysteme Automation/Steuerung FastAPI REST Grafana Zabbix Cloud Architect Incident Management Applied AI Site Reliability Engineering GCP CI/CD Platform Engineering AWS Shell-Script SQL PostgreSQL

Produkte / Standards / Erfahrungen / Methoden

Profile

Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.

Core Competencies

Site reliability, service operations, outage coordination, RCA, postmortem improvement
High-availability design, recovery planning, monitoring, alerting, observability
Linux, Python, Bash, SQL, Docker, Kubernetes, CI/CD, Grafana, Zabbix, Ansible
Production debugging, failure investigation, incident leadership, runbooks, process improvement

Einsatzorte

L�nder

Deutschland, Schweiz, �sterreich

Remote-Arbeit

m�glich

Projekte

Projektinhalte

Designed and built backend-first Python systems with operational concerns such as health checks,�readiness, metrics, workflow control, and structured service behavior.
Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.

Kunde

PlainchaosLab

Einsatzort

Berlin

3 years 5 months

2020-12 - 2024-04

Python-based alerting and analysis pipeline

Systems Resilience Architect and Operational Analyst

Rolle

Systems Resilience Architect and Operational Analyst

Projektinhalte

Served as Lead Incident Commander for critical production outages in a high-availability global�environment, coordinating investigation, response, and follow-up across teams.
Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.

Kunde

Adjust GmbH

Einsatzort

Berlin

2 years

2018-03 - 2020-02

Infrastructure strategy for mission-critical national payment systems

Lead Platform Engineer

Rolle

Lead Platform Engineer

Projektinhalte

Led infrastructure strategy for mission-critical national payment systems and improved Linux platform�availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.

Kunde

PardakhtNovin Arian Co. (PNA)

Einsatzort

Tehran, Iran

5 years 2 months

2013-01 - 2018-02

POS infrastructure

Infrastructure Engineer

Rolle

Infrastructure Engineer

Projektinhalte

Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.

Kunde

Pasargad Electronic Payments (PEP)

Einsatzort

Tehran, Iran

Aus- und Weiterbildung

Software Engineering

B.Sc.

Payame Noor University (PNU) | Iran

Certification

AI Project Expert

Kompetenzen

Top-Skills

Produkte / Standards / Erfahrungen / Methoden

Profile

Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.

Core Competencies

Site reliability, service operations, outage coordination, RCA, postmortem improvement
High-availability design, recovery planning, monitoring, alerting, observability
Linux, Python, Bash, SQL, Docker, Kubernetes, CI/CD, Grafana, Zabbix, Ansible
Production debugging, failure investigation, incident leadership, runbooks, process improvement

Vertrauen Sie auf Randstad

Im Bereich Freelancing

Im Bereich Arbeitnehmer�berlassung / Personalvermittlung

Fragen?

Rufen Sie uns an +49 89 500316-300 oder schreiben Sie uns:

Name E-Mail-Adresse Ihre Frage

Telefonnummer Unternehmen

Ich habe die Datenschutzbestimmungen gelesen und bin damit einverstanden.

Einsatzorte

Einsatzorte

Projekte

Projekte

Independent AI Systems Research and Engineering

Python-based alerting and analysis pipeline

Infrastructure strategy for mission-critical national payment systems

POS infrastructure

Aus- und Weiterbildung

Aus- und Weiterbildung

Kompetenzen

Kompetenzen

Top-Skills

Produkte / Standards / Erfahrungen / Methoden

Einsatzorte

Einsatzorte

Projekte

Projekte

Independent AI Systems Research and Engineering

Python-based alerting and analysis pipeline

Infrastructure strategy for mission-critical national payment systems

POS infrastructure

Aus- und Weiterbildung

Aus- und Weiterbildung

Kompetenzen

Kompetenzen

Top-Skills

Produkte / Standards / Erfahrungen / Methoden

Vertrauen Sie auf Randstad

Fragen?

Rufen Sie uns an +49 89 500316-300 oder schreiben Sie uns:

Das Freelancer-Portal

Direktester geht's nicht! Ganz einfach Freelancer finden und direkt Kontakt aufnehmen.