Senior Data Engineer
Database Technology – Spark, SybaseIQ, DB/2, Snowflake, Redshift, Hive, Presto, Oracle PL/SQL
Tools – AWS, Terraform, Kubernetes, Docker, Jupyter, Intellij, vim, Git, SVN, Apache, nginx, Splunk, SSH
· Primarily should have worked on the Data Lake, a petabyte-scale Data Warehouse built for Goldman Sachs’ unique requirements. The lake is used across hundreds of teams for many time-sensitive critical applications.
· Derived a variety of SLOs and health indicators for the lake. Successfully optimized the lake, bringing ingestion time down under 15 minutes for more than 90% of users.
· Designed an event-driven near real-time SLO monitor for the lake that processes millions of events a minute.
· Crafted terraform AWS configurations from scratch to deploy key lake components to the cloud.
· Developed and maintained a Jupyter notebook ecosystem on Kubernetes to support the SRE team.
· Wrote Jupyter notebooks to analyze telemetry metrics, develop insights, and establish SLOs. Notebooks typically pulled in data using SQL or Pyspark, and further processed in Pandas. Visualizations were done using matplotlib.
· Designed an automation framework for Jupyter notebooks to schedule, cache, serve, and email them to clients.
· Implemented and maintained Prometheus metrics for high-level monitoring of the lake. These metrics are pushed to Grafana for visualization and Pagerduty for alerting.
· Developed on Facebook’s Hadoop system through Hive and Presto, using Facebook’s internal ETL framework.
· Maintained solutions with third parties for ad data ingestion and delivery, including coordination of data definitions and validation checks during ETL process.
· Created APIs using hack (PHP) for upload endpoints.
· Developed dashboards for sales lift data normalized across third parties using Tableau and internal tools.
· Maintained ETL processes to solve bugs, data quality issues, CPU and space optimization, and adding columns to tables, which were mainly core ad metrics data sets that had a wide impact across the company.
· Developed Facebook status tables which was a dataset that exceeded 150TB and over 1.2 trillion rows, from Facebook’s graph structure and curated into an easily digestible hive table, used by research teams for insights, sentiment analysis, and machine learning applications.