Designed and built backend-first Python systems with operational concerns such as health checks, readiness, metrics, workflow control, and structured service behavior.
Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.
PlainchaosLab
Berlin
3 years 5 months
2020-12 - 2024-04
Python-based alerting and analysis pipeline
Systems Resilience Architect and Operational Analyst
Systems Resilience Architect and Operational Analyst
Served as Lead Incident Commander for critical production outages in a high-availability global environment, coordinating investigation, response, and follow-up across teams.
Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.
Adjust GmbH
Berlin
2 years
2018-03 - 2020-02
Infrastructure strategy for mission-critical national payment systems
Lead Platform Engineer
Lead Platform Engineer
Led infrastructure strategy for mission-critical national payment systems and improved Linux platform availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.
PardakhtNovin Arian Co. (PNA)
Tehran, Iran
5 years 2 months
2013-01 - 2018-02
POS infrastructure
Infrastructure Engineer
Infrastructure Engineer
Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.
Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.
Core Competencies
Site reliability, service operations, outage coordination, RCA, postmortem improvement
Production debugging, failure investigation, incident leadership, runbooks, process improvement
Einsatzorte
Einsatzorte
Deutschland, Schweiz, Österreich
möglich
Projekte
Projekte
2 years
2024-04 - now
Independent AI Systems Research and Engineering
Designed and built backend-first Python systems with operational concerns such as health checks, readiness, metrics, workflow control, and structured service behavior.
Developed Writing Agent, a configurable service with health, readiness, and metrics endpoints, creating a compact platform for deployment, observability, and service troubleshooting workflows.
Built Communication Agent, a stateful FastAPI and SQLAlchemy service for secure workflow handling, approvals, and reporting, reinforcing experience with backend service behavior, state management, and operational safeguards.
Developed NeuroTrace, a deterministic Python and NetworkX pipeline that transforms complex interaction data into structured outputs and reproducible metrics, demonstrating disciplined engineering, traceability,and debugging-oriented design.
Continued strengthening hands-on deployment, observability, and reliability-oriented engineering practices while transitioning toward platform and reliability-focused roles.
PlainchaosLab
Berlin
3 years 5 months
2020-12 - 2024-04
Python-based alerting and analysis pipeline
Systems Resilience Architect and Operational Analyst
Systems Resilience Architect and Operational Analyst
Served as Lead Incident Commander for critical production outages in a high-availability global environment, coordinating investigation, response, and follow-up across teams.
Reduced root cause analysis time by 30% through more structured incident handling, clearer ownership, and improved information flow during response.
Created standardized incident communication procedures and response playbooks, reducing repetitive alerts by 30% and achieving a 15-minute mean time to engage for the full response team.
Built a custom Python-based alerting and analysis pipeline using Zabbix, Slack, and Grafana APIs, leading to 60% faster anomaly detection.
Prototyped a GPT-assisted post-incident analyzer that transformed discussion data into structured postmortem drafts, reducing manual documentation effort by 80%.
Adjust GmbH
Berlin
2 years
2018-03 - 2020-02
Infrastructure strategy for mission-critical national payment systems
Lead Platform Engineer
Lead Platform Engineer
Led infrastructure strategy for mission-critical national payment systems and improved Linux platform availability from 84.6% to 99.5% through high-availability redesign and recovery automation.
Automated server hardening with a Bash framework aligned with NIST SP 800-123, reducing deployment time by 80% and supporting audit readiness.
Audited more than 120 Linux servers and enabled infrastructure consolidation, reclaiming capacity for new R&D usage.
Hired and mentored technical staff, enabling a junior engineer to take full ownership of payment infrastructure management within 12 months.
PardakhtNovin Arian Co. (PNA)
Tehran, Iran
5 years 2 months
2013-01 - 2018-02
POS infrastructure
Infrastructure Engineer
Infrastructure Engineer
Supported naVonal-scale POS infrastructure processing approximately 2 million transacQons per day in a high-stakes producVon environment.
Reverse-engineered proprietary vendor systems and built custom monitoring soluVons, saving $18K annually in license costs.
Improved system upVme from 85% to 99.5% through custom monitoring and alerVng workflows.
Designed an incident response pipeline that reduced troubleshooVng informaVon-gathering Vme to 15 minutes and reduced unstructured reporVng by 80%.
Systems reliability engineer with 15+ years across high-availability infrastructure, incident command, monitoring, automation, and platform operations. Strong record in stabilizing production systems, improving reliability, reducing detection and resolution time, and turning recurring operational pain into durable standards, tooling, and recovery workflows.
Hands-on in Linux operations, Python automation, observability, RCA, and operational leadership under pressure. Strong fit for roles centered on service reliability ownership, production debugging, incident response, automation, and practical platform improvement.
Core Competencies
Site reliability, service operations, outage coordination, RCA, postmortem improvement