Senior Data Engineer mit Cloud-Erfahrung und Hintergrund als Software-Entwickler. Besonderer Schwerpunkt Apache Toolstack.
Aktualisiert am 11.06.2024
Profil
Freiberufler / Selbstständiger
Remote-Arbeit
Verfügbar ab: 01.07.2024
Verfügbar zu: 100%
davon vor Ort: 5%
airflow
Python
kubernetes
Flink
docker
Terraform
AWS
GCP
flink
TDD
Apache Spark
Java
SQL
Spring
Spring MVC
Machine Learning
Datawarehouse
Streaming
Redis
German
Native
English
Fluent
French
Basic

Einsatzorte

Einsatzorte

Deutschland, Schweiz, Österreich
möglich

Projekte

Projekte

7 Monate
2022-06 - 2022-12

Cluster Migration of Internal Data Warehouse

Senior Data Engineer
Senior Data Engineer

As data volumes continue to grow for eCommerce companies and the number of data consumers within the organization is increasing, sometimes old infrastructure will not be able to keep up with the challenges. Additionally, In this particular case, the computing and warehousing cluster has to be on-premise for data security reasons. After new cluster-infrastructure had been provided by an external provider, all data warehouse and computing logic has to be migrated from the old infrastructure to the new infrastructure. Additional challenges are maintaining backwards compatibility of the migrated processes at all times and adhering to strict security standards.

  • Migration and Deployment of 30+ airflow DAGs with 20 ? 50 Tasks each on new infrastructure
  • Co-development of a python client library for Apache Livy that is used by 100+ airflow tasks
  • Deployment of 20+ Apache Hive databases with 10 ? 50 tables each in three Data Warehouse layers via Ansible
  • Code review of 5-10 merge requests per week
Apache airflow Python Apache Hive Apache Spark PySpark Apache Livy Apache Hadoop Ansible Gitlab CI Jenkins Apache Knox Scrum Jira Confluece
2 Jahre 5 Monate
2020-02 - 2022-06

Platform for Real Time Fraud Detection

Lead Developer, Architect
Lead Developer, Architect

In order to prevent financial and reputational loss in eCommerce platforms an automated security system is needed that can detect fraud patterns in online shop. The software, written in Java with Apache Flink, should be able to scale out over multiple shop systems and data sources. Further requirements are monitoring traffic in real time and incorporating expert knowledge alongside machine learning and Artificial Intelligence (A.I.) models. The software is deployed and operated on the customers cloud environtment by using modern Continuous Integration (CI) and DevOps principles.

  • Lead design of the platform, Technical Lead for a team of 5 Developers
  • Implementation of a proof of concept with Java from which 80% of code made the first product iteration
  • Prototyping of two end-to-end MLOps workflows with MLflow and AWS Sagemaker
  • Successful deployment and zero downtime operations on customer premises at around 15 million events per day
  • Design of cloud based testing environment that can be brought up in less than 15 minutes (Infrastructure as Code) and handle up to 10 times of the production workload

Java JUnit Apache maven Apache Flink Apache Kafka Redis Terraform AWS EKS AWS Cloudformation kubernetes helm docker Datadog Gitlab CI MLFlow AWS Sagemaker scikit-learn AWS S3 AWS RDS Trello
IT Consultancy, Internal Product Development
4 Jahre 10 Monate
2017-03 - 2021-12

ETL-Pipeline Architecture with Apache Airflow und kubernetes

Data Engineer
Data Engineer

A datadriven company needs to have a reliable and scalable infrastructure as a key components of the corporate decision making. Engineers as well as analysts need to be enabled to create ETL-processes, Artificial Intelligence (A.I.) jobs and ad-hoc reports without the need to consult with a data engineer. The data architecture of the company needs to provide scalability, clear separation between testing and production and ease of use. Modern DevOps practices like Continuous Integration (CI) and Infrastructure as Code need to be employed across the whole infrastructure.

  • Leading conception of cloud based infrastructure based on the above requirements
  • Initial training of 5 developers an onboarding of more than 10 developers since
  • Initial setup and operation of apache airflow with intially ca. 10 jobs, scaling up to more than 100 regular scheduled jobs at present

Apache airflow kubernetes docker AWS EKS AWS EC2 AWS IAM AWS S3 AWS EMR AWS RDS Gitlab CI Scrum Jira Confluence
Multichannel Retailer, Furniture
4 Jahre 11 Monate
2017-02 - 2021-12

A/B-Testing Plattform

Data Scientist, Lead Developer
Data Scientist, Lead Developer

In order to enable an eCommerce organization to become a datadriven organization there must be (among other things) a framework present to compare different version of the website against each other. Many members of the organization and departments need to be able to create and conduct experiments without the assistance of a data engineer. Anther important factor for the framework was the usage Bayesian statistics.

  • Leading Conception of testing framework including randomization logic, statistical modelling and grapical presentation in the frontend
  • Implementation of proof of concept for statistical engine
  • Implementation of production code for frontend, backend, statistical engine
  • Training of stakeholders from 3 different departments in methodology and statistical background of A/B-Testing
python PyMC3 Python SciPy Apache Spark Python pySpark Apache airflow docker Jenkins kubernetes VueJS Redshift Scrum Jira Confluence
Multichannel Retailer, Furniture
3 Jahre 1 Monat
2018-06 - 2021-06

Webtracking Event Pipeline with snowplow in AWS

Senior Data Engineer
Senior Data Engineer

For an eCommerce Platform it is crucial to have a detailed picture of customer behaviour on which business decisions can be based. Either in real-time or from the data warehouse. For that a flexible, scalable, and fieldtestet solution is necessary which can run in the cloud. Additionally, all browser events need a custom enrichment with business information from the backend in order to provide necessary context e.g. for ?Add to Cart?-events. The webtracking pipeline is managed by using modern DevOps principles: Continuous Integration (CI), zero downtime deployments and Infrastructure as Code.

  • Integration of snowplow event-pipeline in cloud based shop architecture
  • Day to day operations of event-pipeline at ca. 4 million events per day
  • Co-Engineering of custom enrichment in the webshop backend (ca. 1000+ lines of code) and handover of ownership to the backend team
  • Setup of custom real time event monitoring (< 1s latency) with elasticsearch and kibana
  • Setup of custom scheduling and deployment processes for 5 components of the snowplow event-pipeline

snowplow kubernetes AWS EMR AWS EKS AWS EC2 AWS kinesis AWS redshift Apache airflow kibana elasticsearch NodeJS Gitlab CI AWS RDS Scala Scrum Jira Cofluence
Multichannel Retailer, Furniture
1 Jahr 3 Monate
2018-02 - 2019-04

Product Recommendation Engines: Collaborative Filtering and Item Similarity with Neural Nets

Data Scientist, Data Engineer
Data Scientist, Data Engineer

To enrich the shopping experience of the customer and to drive additional sales, the eCommerce platform should be able to recommend customers additional products. Two orthogonal strategies are employed: Product based similiarity based on neural network embeddings and collaborative filtering based on user behaviour. Additionally, Performance monitoring for the recommendations is needed.

  • Productionize both models based on proof of concepts by ML engineer including data aquisition, running of the model and data output
  • Scheduling and operations of productionized models, including 3 different code bases and more than 5 regularly scheduled jobs
  • Operationalization of 10+ performance metrics over 5 dashboards for stakeholders
Tensorflow keras scikit-learn Python pandas AWS EMR Java ant Spring hybris Apache Mahout AWS Redshift Apache airflow apache superset Scrum Confluence Jira
Multichannel Retailer, Furniture

Aus- und Weiterbildung

Aus- und Weiterbildung

2003 - 2008

Christian-Albrechts-Universität zu Kiel, Germany

Degree: Magister / Master of Arts


Focus:

  • Major: Philosophy
  • Minors: Musicology, Computer Science


2002

Gymnasium Winsen/Luhe, Germany

Abitur


Online Courses

2019

Coursera

  • DeepLearning.AI Deep Learning Specialization
  • DeepLearning.AI TensorFlow Developer
  • Probabilistic Graphical Models: Representation


2014

Coursera

  • Machine Learning


Certificates

2022

Coursera

  • Machine Learning Engineering for Production (MLOps)


2019

Coursera

  • DeepLearning.AI TensorFlow Developer
  • DeepLearning.AI Deep Learning


2013

  • IT agile - Scrum Master

Position

Position

  • Data Engineering
  • ML Ops

Kompetenzen

Kompetenzen

Top-Skills

airflow Python kubernetes Flink docker Terraform AWS GCP flink TDD Apache Spark Java SQL Spring Spring MVC Machine Learning Datawarehouse Streaming Redis

Produkte / Standards / Erfahrungen / Methoden

Frameworks

Python:

  • pandas
  • upyter
  • numpy
  • matplotlib
  • flask
  • scikit-learn
  • keras
  • Tensorflow
  • Apache airflow
  • pySpark
  • pyMC3 


Java:

  • Spring
  • JUnit
  • Mockito
  • maven
  • ant
  • hybris 


JavaScript:

  • NodeJS
  • ExpressJS
  • VueJS
  • ChartJS


Cloud DevOps

  • AWS
  • kubernetes
  • helm
  • docker
  • terraform
  • Gitlab CI
  • Jenkins
  • Apache airflow
  • Datadog
  • Hadoop (HDFS)
  • AWS EKS
  • AWS EMR
  • AWS EC2
  • AWS Cloudformation
  • AWS Secrets Manager
  • AWS RDS
  • AWS S3
  • GCP


Machine Learning

  • Tensorflow
  • PyTorch
  • keras
  • scikit-learn
  • pyMC3
  • MLflow
  • AWS Sagemaker
  • LakeFS
  • Pinecone


Streaming

  • Apache Spark
  • Apache Flink
  • Apache Kafka
  • amazon Kinesis
  • snowplow


Engineering Concept

  • Object Oriented Programming
  • Test Driven Development (TDD)
  • Functional Programming
  • Domain Driven Design (DDD)
  • Clean Code


Security

  • ssh
  • Snyk
  • kerberos
  • Apache Knox
  • AWS IAM
  • VPN


Agile Concepts and Tools

  • Scrum (Certified Scrum Master)
  • Kanban
  • Jira
  • Confluence
  • Trello
  • Redmine


Software

  • IntelliJ Idea
  • PyCharm
  • vim
  • tmux
  • bash
  • fish


Platforms

  • Linux
  • Mac/OSX


Experience

2020 - today

Role: Team Lead Data Engineering / Data Science 

Customer: Neuland ? Büro für Informatik


2017 - 2020

Role: Data Engineer / Data Scientist 

Customer: Neuland ? Büro für Informatik


2015 - 2017

Role: Back End Developer 

Customer: Neuland ? Büro für Informatik


2012 - 2015 

Role: Project Manager 

Customer: Neuland ? Büro für Informatik


2012 - 2012

Role: Management Assistant to the CTO 

Customer: OXID eSales


2010 - 2012 

Role: Public Relations Consultant 

Customer: rheinfaktor

Programmiersprachen

Python
Professional Usage
Java
Professional Usage
SQL
Professional Usage
JavaScript
Professional Usage
bash
Professional Usage
Lisp
Basic Knowledge
Haskell
Basic Knowledge
R
Basic Knowledge
Octave
Basic Knowledge
C
Basic Knowledge

Datenbanken

MySQL
PostgreSQL
Redis
AWS Redshift
Cassandra
AWS Athena
Apache Hive
Apache solr
elasticsearch

Einsatzorte

Einsatzorte

Deutschland, Schweiz, Österreich
möglich

Projekte

Projekte

7 Monate
2022-06 - 2022-12

Cluster Migration of Internal Data Warehouse

Senior Data Engineer
Senior Data Engineer

As data volumes continue to grow for eCommerce companies and the number of data consumers within the organization is increasing, sometimes old infrastructure will not be able to keep up with the challenges. Additionally, In this particular case, the computing and warehousing cluster has to be on-premise for data security reasons. After new cluster-infrastructure had been provided by an external provider, all data warehouse and computing logic has to be migrated from the old infrastructure to the new infrastructure. Additional challenges are maintaining backwards compatibility of the migrated processes at all times and adhering to strict security standards.

  • Migration and Deployment of 30+ airflow DAGs with 20 ? 50 Tasks each on new infrastructure
  • Co-development of a python client library for Apache Livy that is used by 100+ airflow tasks
  • Deployment of 20+ Apache Hive databases with 10 ? 50 tables each in three Data Warehouse layers via Ansible
  • Code review of 5-10 merge requests per week
Apache airflow Python Apache Hive Apache Spark PySpark Apache Livy Apache Hadoop Ansible Gitlab CI Jenkins Apache Knox Scrum Jira Confluece
2 Jahre 5 Monate
2020-02 - 2022-06

Platform for Real Time Fraud Detection

Lead Developer, Architect
Lead Developer, Architect

In order to prevent financial and reputational loss in eCommerce platforms an automated security system is needed that can detect fraud patterns in online shop. The software, written in Java with Apache Flink, should be able to scale out over multiple shop systems and data sources. Further requirements are monitoring traffic in real time and incorporating expert knowledge alongside machine learning and Artificial Intelligence (A.I.) models. The software is deployed and operated on the customers cloud environtment by using modern Continuous Integration (CI) and DevOps principles.

  • Lead design of the platform, Technical Lead for a team of 5 Developers
  • Implementation of a proof of concept with Java from which 80% of code made the first product iteration
  • Prototyping of two end-to-end MLOps workflows with MLflow and AWS Sagemaker
  • Successful deployment and zero downtime operations on customer premises at around 15 million events per day
  • Design of cloud based testing environment that can be brought up in less than 15 minutes (Infrastructure as Code) and handle up to 10 times of the production workload

Java JUnit Apache maven Apache Flink Apache Kafka Redis Terraform AWS EKS AWS Cloudformation kubernetes helm docker Datadog Gitlab CI MLFlow AWS Sagemaker scikit-learn AWS S3 AWS RDS Trello
IT Consultancy, Internal Product Development
4 Jahre 10 Monate
2017-03 - 2021-12

ETL-Pipeline Architecture with Apache Airflow und kubernetes

Data Engineer
Data Engineer

A datadriven company needs to have a reliable and scalable infrastructure as a key components of the corporate decision making. Engineers as well as analysts need to be enabled to create ETL-processes, Artificial Intelligence (A.I.) jobs and ad-hoc reports without the need to consult with a data engineer. The data architecture of the company needs to provide scalability, clear separation between testing and production and ease of use. Modern DevOps practices like Continuous Integration (CI) and Infrastructure as Code need to be employed across the whole infrastructure.

  • Leading conception of cloud based infrastructure based on the above requirements
  • Initial training of 5 developers an onboarding of more than 10 developers since
  • Initial setup and operation of apache airflow with intially ca. 10 jobs, scaling up to more than 100 regular scheduled jobs at present

Apache airflow kubernetes docker AWS EKS AWS EC2 AWS IAM AWS S3 AWS EMR AWS RDS Gitlab CI Scrum Jira Confluence
Multichannel Retailer, Furniture
4 Jahre 11 Monate
2017-02 - 2021-12

A/B-Testing Plattform

Data Scientist, Lead Developer
Data Scientist, Lead Developer

In order to enable an eCommerce organization to become a datadriven organization there must be (among other things) a framework present to compare different version of the website against each other. Many members of the organization and departments need to be able to create and conduct experiments without the assistance of a data engineer. Anther important factor for the framework was the usage Bayesian statistics.

  • Leading Conception of testing framework including randomization logic, statistical modelling and grapical presentation in the frontend
  • Implementation of proof of concept for statistical engine
  • Implementation of production code for frontend, backend, statistical engine
  • Training of stakeholders from 3 different departments in methodology and statistical background of A/B-Testing
python PyMC3 Python SciPy Apache Spark Python pySpark Apache airflow docker Jenkins kubernetes VueJS Redshift Scrum Jira Confluence
Multichannel Retailer, Furniture
3 Jahre 1 Monat
2018-06 - 2021-06

Webtracking Event Pipeline with snowplow in AWS

Senior Data Engineer
Senior Data Engineer

For an eCommerce Platform it is crucial to have a detailed picture of customer behaviour on which business decisions can be based. Either in real-time or from the data warehouse. For that a flexible, scalable, and fieldtestet solution is necessary which can run in the cloud. Additionally, all browser events need a custom enrichment with business information from the backend in order to provide necessary context e.g. for ?Add to Cart?-events. The webtracking pipeline is managed by using modern DevOps principles: Continuous Integration (CI), zero downtime deployments and Infrastructure as Code.

  • Integration of snowplow event-pipeline in cloud based shop architecture
  • Day to day operations of event-pipeline at ca. 4 million events per day
  • Co-Engineering of custom enrichment in the webshop backend (ca. 1000+ lines of code) and handover of ownership to the backend team
  • Setup of custom real time event monitoring (< 1s latency) with elasticsearch and kibana
  • Setup of custom scheduling and deployment processes for 5 components of the snowplow event-pipeline

snowplow kubernetes AWS EMR AWS EKS AWS EC2 AWS kinesis AWS redshift Apache airflow kibana elasticsearch NodeJS Gitlab CI AWS RDS Scala Scrum Jira Cofluence
Multichannel Retailer, Furniture
1 Jahr 3 Monate
2018-02 - 2019-04

Product Recommendation Engines: Collaborative Filtering and Item Similarity with Neural Nets

Data Scientist, Data Engineer
Data Scientist, Data Engineer

To enrich the shopping experience of the customer and to drive additional sales, the eCommerce platform should be able to recommend customers additional products. Two orthogonal strategies are employed: Product based similiarity based on neural network embeddings and collaborative filtering based on user behaviour. Additionally, Performance monitoring for the recommendations is needed.

  • Productionize both models based on proof of concepts by ML engineer including data aquisition, running of the model and data output
  • Scheduling and operations of productionized models, including 3 different code bases and more than 5 regularly scheduled jobs
  • Operationalization of 10+ performance metrics over 5 dashboards for stakeholders
Tensorflow keras scikit-learn Python pandas AWS EMR Java ant Spring hybris Apache Mahout AWS Redshift Apache airflow apache superset Scrum Confluence Jira
Multichannel Retailer, Furniture

Aus- und Weiterbildung

Aus- und Weiterbildung

2003 - 2008

Christian-Albrechts-Universität zu Kiel, Germany

Degree: Magister / Master of Arts


Focus:

  • Major: Philosophy
  • Minors: Musicology, Computer Science


2002

Gymnasium Winsen/Luhe, Germany

Abitur


Online Courses

2019

Coursera

  • DeepLearning.AI Deep Learning Specialization
  • DeepLearning.AI TensorFlow Developer
  • Probabilistic Graphical Models: Representation


2014

Coursera

  • Machine Learning


Certificates

2022

Coursera

  • Machine Learning Engineering for Production (MLOps)


2019

Coursera

  • DeepLearning.AI TensorFlow Developer
  • DeepLearning.AI Deep Learning


2013

  • IT agile - Scrum Master

Position

Position

  • Data Engineering
  • ML Ops

Kompetenzen

Kompetenzen

Top-Skills

airflow Python kubernetes Flink docker Terraform AWS GCP flink TDD Apache Spark Java SQL Spring Spring MVC Machine Learning Datawarehouse Streaming Redis

Produkte / Standards / Erfahrungen / Methoden

Frameworks

Python:

  • pandas
  • upyter
  • numpy
  • matplotlib
  • flask
  • scikit-learn
  • keras
  • Tensorflow
  • Apache airflow
  • pySpark
  • pyMC3 


Java:

  • Spring
  • JUnit
  • Mockito
  • maven
  • ant
  • hybris 


JavaScript:

  • NodeJS
  • ExpressJS
  • VueJS
  • ChartJS


Cloud DevOps

  • AWS
  • kubernetes
  • helm
  • docker
  • terraform
  • Gitlab CI
  • Jenkins
  • Apache airflow
  • Datadog
  • Hadoop (HDFS)
  • AWS EKS
  • AWS EMR
  • AWS EC2
  • AWS Cloudformation
  • AWS Secrets Manager
  • AWS RDS
  • AWS S3
  • GCP


Machine Learning

  • Tensorflow
  • PyTorch
  • keras
  • scikit-learn
  • pyMC3
  • MLflow
  • AWS Sagemaker
  • LakeFS
  • Pinecone


Streaming

  • Apache Spark
  • Apache Flink
  • Apache Kafka
  • amazon Kinesis
  • snowplow


Engineering Concept

  • Object Oriented Programming
  • Test Driven Development (TDD)
  • Functional Programming
  • Domain Driven Design (DDD)
  • Clean Code


Security

  • ssh
  • Snyk
  • kerberos
  • Apache Knox
  • AWS IAM
  • VPN


Agile Concepts and Tools

  • Scrum (Certified Scrum Master)
  • Kanban
  • Jira
  • Confluence
  • Trello
  • Redmine


Software

  • IntelliJ Idea
  • PyCharm
  • vim
  • tmux
  • bash
  • fish


Platforms

  • Linux
  • Mac/OSX


Experience

2020 - today

Role: Team Lead Data Engineering / Data Science 

Customer: Neuland ? Büro für Informatik


2017 - 2020

Role: Data Engineer / Data Scientist 

Customer: Neuland ? Büro für Informatik


2015 - 2017

Role: Back End Developer 

Customer: Neuland ? Büro für Informatik


2012 - 2015 

Role: Project Manager 

Customer: Neuland ? Büro für Informatik


2012 - 2012

Role: Management Assistant to the CTO 

Customer: OXID eSales


2010 - 2012 

Role: Public Relations Consultant 

Customer: rheinfaktor

Programmiersprachen

Python
Professional Usage
Java
Professional Usage
SQL
Professional Usage
JavaScript
Professional Usage
bash
Professional Usage
Lisp
Basic Knowledge
Haskell
Basic Knowledge
R
Basic Knowledge
Octave
Basic Knowledge
C
Basic Knowledge

Datenbanken

MySQL
PostgreSQL
Redis
AWS Redshift
Cassandra
AWS Athena
Apache Hive
Apache solr
elasticsearch

Vertrauen Sie auf GULP

Im Bereich Freelancing
Im Bereich Arbeitnehmerüberlassung / Personalvermittlung

Fragen?

Rufen Sie uns an +49 89 500316-300 oder schreiben Sie uns:

Das GULP Freelancer-Portal

Direktester geht's nicht! Ganz einfach Freelancer finden und direkt Kontakt aufnehmen.