Max Sargent
Computer Scientist, DevOps Engineer, Cloud Native.
Personal
Experience
8+ years.
Summary
A senior DevOps Engineer. I turn dev teams into well oiled machines.
Techincal Interests
Containerization, container orchestration, IaC, CI/CD optimization, Go software development, adoption of DevOps culture.
DevOps is achieved when there are no development or operations silos.
Hobbies
- DIY
- Running
- Swimming
- Cycling
- Triathlon
- Golf
- Homelab (OPNSense firewall & L2 switch, proxmox cluster, ubiqiti APs, VLAN segregated network)
Employment
Jan 2022 - Present
Un-named Gambling Consultancy
DevOps Engineer - Un-named Delivery Team
- Cloud Architecture for Greenfield Real Time Events System: Wrote IaC and wrote k8s configuration for entire system. Configured public facing load balancers global accelerator, firewalls & loadbalancers passing external penetration testing. Configured public DNS records and subdomain delegation. All for greenfield system build. New arm of organization building a cloud native system from scratch. Built on MSK, EKS, Aurora, Debezium. Fully tenantized supporting multiple production instances for canary releases as well as required tenants. Multiple staging environments to isolate change streams. Worked with developers to integrate .NET applications with entra using Microsoft identity libraries, implementing various flows for front end login, service to service auth and user impersonation for debugging via swagger.
- On Call & Post Mortem: Part of on call rota to provide 24/7 incident response for disruption to revenue streams. Several instances of troubleshooting production outages under pressure to bring systems back online within tight time frames. Execution of post mortem to understand root cause and taking actions to prevent issues reccuring. Current record stands ar 281 days without a production outage.
- DR Plan & Implementation: System broken down into recovery units, individual elements that must be restored in order to restore service. Each unit given dedicated strategy documentation to follow in a DR situation combined with team specific idiosyncrasies - a recovery action. Combining these actions into Gantt charts to define a full DR plan for several complex microservice architectures. Created and drove internal processes for continued testing and refinement of DR plan & documentation on a regular basis. Some code changes required to actualize the plan which took several months.
- Organizational Backup Architecture: Enhancing existing work done using AWS backup to facilitate organizational level cross account backups into a data bunker account hosting logically airgapped backup vaults. Backups scheduled via AWS tags on resources and accessible in a recovery account using RAM shares that are managed by the cyber security team.
- Team Application Provisioner: Creating a self service system (using terraform) for microservice scaffolding. Essentially categorising the types of services we build (CQRS service, UI, API, and a generic template), templating them out as dotnet templates and integrating it all into our teams own CLI tool. One command to have a walking skeleton microservice deployed into local dev, preview, UAT and production. Integrated with existing oauth provisioning and infrastructure provisioning.
- Gitea VCS Self Hosting: Rolling out terraform for provisioning, configuring and operating an internal VCS for highly sensitive code. Our organisation works in tandem with several others, data is exchanged via various protocols and infrastructure. The contracts for the shape of this data need mutual sign off from all parties before releasing to schema registries and being published as consumables client packages in several languages. Before this change critical schema management was disorganized and could impede delivery.
- Custom Go Proxies: End to end writing and deployment of custom go proxies with two main purposes:
- secure access to sensitive modelling workloads from developer workstations
- secure log ingestiom from preview environments into central quickwit log aggregation
- MSSQL Server to MySQL Heterogeneous Migration via AWS DMS: Configured and carried out with a partner in the development team the migration of our entire backend relational datastores from MS SQL Server to AWS Aurora MySQL using AWS Database Migration service.
- Vector/Quickwit/Grafana: Fallback / tier2 log ingestion and aggregation configuration. Backed by AWS S3 and serverless postgres database for metadata store. Teams simply spin up one terraform module and install one helm chart to have another option outside of datadog for ‘lower value logs’ with extremely high retention periods.
- RabbitMQ Cloud Migration: Migrated legacy on prem VM based rabbitmq cluster to the cloud using the pivotal rabbitmq operator instead of amazonmq, resulting in a 5x cost saving and new deployment capabilities such a blue green infra upgrades. Utilizing annotated kubernetes service, external-dns and aws-lb-controller to provision domain named L4 loadbalancer for connections from on prem thick client applications. Utilized pod readiness gates and pod disruption budgets to ensure zero downtime when rolling out updates to the new cloud system.
- Benthos Streaming Configuration: Configured and deployed several benthos helm charts for myriad stream processing tasks such as:
- mission critical live events ingestion from 3rd parties via http_server input, transformation/enrichment with metadata into internal key event and publishing onto kafka
- data replication from production into, UAT, preview and local development environments for seamless development
- facilitating migration of ‘exhaust’ data to analytics department from old on prem in house system to kafka via dual publish mechanism
- Migration to arm64: Migration of the majority of the workload to run on arm64. The process involved upgrading our on-premise CI servers to support qemu & qemu-user-static, then writing a v2 our build pipeline that uses buildx to target both amd64 and arm64 platforms using a multiarchitecture build and publishing these results on artifactory. Around the edges of our system were some custom builds of kubernetes operators, sidecars etc that also needed their custom builds upgrading to use buildx. After work was complete ec2 costs in my team were cut by around 25%, other teams picked up the work saving more money.
- Environment as a Service: Taking the integration testing environments to their logical conclusion was enhancing our existing work to also allow for the provisioning of test environment on demand, not for running automated unit tests, but for exposing to developers for testing large changesets that could destabilize our existing static test environments. This was achieved by securing the integration test environment and providing some extensions to allow external access. Secrets within the generated yaml that were previously static for local dev are now intercepted by a custom helm post renderer and saved in vault, hostnames are remapped based on a string provided in the environment, vault operator provides a fully signed PKI using the PKI secrets engine and externalDNS is leveraged to expose ingress resources within the VPC, as well as services exposing level 4 connections (dbs, message queues etc) to allow developers to log in and debug infrastructure.
- Integration Testing Platform: Building on the previously mentioned local development environment, my team looked to build out an integration testing capability. No full time QA team existed so the solution needed to seamlessly integrate with existing local development tooling and require no exper knowledge of QA. By replatforming the local development environment from docker-desktop, to K3S/K3D, we were able to achieve portability.
- Tilt Local Development Environment: For maximum developer productivity a short feedback loop is required when making code changes. Our existing effort in local development was a set of ancient powershell scripts, the environment worked when concieved but due to the number of changes going on as part of the cloud migration it was falling into disrepair. By combining Tilt, Helm, Docker Desktop (or Rancher Desktop), dnsmasq and misc other technologies my team and I created a turnkey local development environment that runs the system end to end locally. Additionally by leveraging the previously mentioned helm library, and integrating it with the local development environment, we reached a state where our local service configuration mirrors our live configuraiton. This makes getting changes into production safely much easier and reduces the strain on our fully fledged testing environments. The initial work was picked up by an internal team, enhanced, and rolled out across the organisation.
- Go Replay Prod Testing Capability: Some releases require testing with live traffic in order to verify correctness of the changes, in very modern systems this can be achieved by leveraging a service mesh. In this instance that was not an option so as an alternative, in cases where it is required only (not on all prod workloads), I combined a GoReplay sidecar & Caddy reverse proxy sidecar to provide TLS to the pod and allow for traffic redirection from production to testing environments. This functionality was baked into the helm chart library to make it as simple as flicking a switch to enable this configuration and perform testing.
- Kafka Connect Scaffolding: Configuring and maintaining several kafka-connect clusters that support critical production workloads. Debezium used for change data capture and creating event streams from databases and misc sink connectors for archiving external data feeds for audit purposes. Due to the immaturity of MSK connect and my teams investment into kubernetes we opted to run our kafka-connect clusters within EKS, this presented its own challenges due to the conflicting scheduling efforts of kafka-connect and kubernetes. I created base templates conforming to our existing microservices deployment and wrote scripts to allow for scaffolding of connectors easily.
- Championing SLOs & SLIs: The profitability of the organisation was directly tied to the availability of the system to end users within specific time windows. After a signficant effort in standardizing logging, adding metrics, enabling application profiling and tracing we had a large amount of data about our applications performance but no real framework for leveraging it. By introducing some standard SRE practices - identifying critical end user workflows, defining SLOs & SLIs based off of these workflows - the systems availability to end users saw a large increase as we began identifying production issues before users encountered them.
- Bespoke K8s Model Operator: Wrote a custom kubernetes operator to provide discrete autoscaling. Traditional metric based autoscaling didn’t fit this specific workload, it needed to be scaled instantly with a specific resource requirement. This operator interfaces with multiple places within the system to calculate the exact resource requirements, and utilizes a separate karpenter provisioner to provide an isolated workload, this isolation was guaranteed via node tainting.
- Optimizing HPC workloads on K8s: Configuring kubernetes CPU manager settings to exclusively bind processes to CPU cores, disabling hyperthreading via kernel parameters then writing and executing custom performance & sizing with k6. Picking the right instance type for each workload. Difficult due to running a sidecar, and CPU binding happens at pod level.
- Datadog & Opentelemetry Onboarding: Co-operating with the platform team to onboard my team onto datadog. Including work to standardize log output format, integrating with the .NET tracer & datadog APM, configuring opentelmetry installation in kubernetes and enabling sidecar injection, dashboard and alert creation. Contributing work back to base infrastucture used for cluster provisioning.
- Active/passive HA via custom Go leadership election sidecar: Wrote a simple Go sidecar container that manages a single kubernetes lease object between multiple pod replicas. The sidecars elect a single leader to assume the ‘global singleton process’ and update service selectors to re-route traffic. This was a requirement to bring high availability to legacy services in a consistent way without rewriting large amounts of the codebase.
- Custom Traefik Middleware for JWT verification: Wrote a simple Go middleware plugin for Traefik. The middleware, when applied to a route within the reverse proxy, extracts the bearer JWT from the request auth header and verifies the JWT against the JWKS provided from Okta to authenticate the request, optionally verifiying against preconfigured scope claims for authorization. The middleware caches the JWKS at set intervals to reduce request overhead vs token introspection from the identity provider. The end use of this work secured sensitive model workloads prior to cloud migration.
- KEDA Rollout for event driven scaling: Some workloads migrated to the cloud had specific resource profiles, extremely bursty but only required for specific time windows. To faciliate this KEDA was leveraged to bring kubernetes nodes of a specific instance type online (0 to 1 scaling) via the result of an SQL query, then this capacity was scaled horizontally via metrics exposed by reverse proxy sidecars relating to request latency (1 to n scaling). The result was equivalent to a serverless architecture in terms of cost, but less cold start effect.
- Cluster autoscaler -> Karpenter migration: Replaced traditional cluster autoscaling with Karpenter across the organization. Providing faster workload scheduling when clusters are required to scale and more cost-opitimized workload scheduling when services are marked as flexible to being run on spot instances. Involved updates to shared terraform code used for cluster provisioning.
- Helm Chart Library: Design & implementation of an internal helm chart library. With hundreds of microservices deployed to kubernetes managing the configuration of each service becomes a chore, this leads to copy and paste errors when new services are created. By leveraging a library of helm templates pre-configured to integrate with existing internal frameworks new services are now configured quickly and with fewer errors. Futhermore it provides an interface between DevOps and development teams where Kubernetes best practices can be inherited, as well as a future channel for sweeping changes if required.
Dec 2020 - Jan 2022
Verint Systems
Associate DevOps Engineer
Work Assist - A cloud-native, extensible, real-time notification system built on Apache Kafka and Redis Pub <-> Sub.
- Deployment Architecture: Design of the applications kubernetes configuration including basic ingresses, services, deployments, configmaps. Utilization of more complex patterns such as ingress controllers (nginx), mutating admission controllers (vault agent injectors) and horizontal pod autoscalers that leverage prometheus for custom metrics. Creating Terraform modules and wrapping them in docker containers for run anywhere infrastructure provisioning for services such as RDS (MySQL), Elasticache (Redis), MSK (Kafka).
- Ground-up CI/CD Design & Implementation: Creation of the end to end CI/CD systems for Work Assist. Key characteristics being a short build time and full adoption of GitOps, allowing the entire application to be deployed to different AWS regions with the change of a single properties file and a git branch operation. Shifting left all quality and security analysis to a develop branch, protecting master branch so that only 100% verified changes make it into any cloud environment.
- Prod Operations: Attending CAB meetings to present changes to production environments, scheduling said changes for go live and then delivering them within the required timeframe. Driving Work Assist forward as the first full stack application deployed on the platform. Performing all initial deployments and subsequent updates.
- Local Development Environment: Creating a local development environment with production parity using kustomize, KinD, Helm, Vagrant and bash. Providing short and effective local devlopment cycles that allow for verification of behaviour of a colleciton of microservices in a kubernetes cluster, rather than two docker containers talking to each other on localhost.
SCP (Single Cloud Platform) - Poly-cloud platform for internal cloud-native projects.
- Security: Called up as part of a cross-functional team to deliver critical security fixes ready for go-live. Reworking parts of the platform to comply with PII and PCI regulations and rolling these changes out to all regions in production.
- Innersource: Contributing infrastructure components back to the wider engineering organization in a controlled way via innersource. Championing inner-source practices as the company pivots more towards cloud adoption. Giving demonstrations to large audiences regarding new enhancements to existing systems and best practices for DevOps as teams move to cloud development.
TheRM - Building, enhancing and maintaining CI/CD infrastructure for many development teams.
- CI/CD: Supporting and enhancing 50+ ‘legacy’ CI/CD pipelines that build projects utilizing several different languages, frameworks and toolchains. Driving forward efforts to increase overall pipeline success rate, within 3 months average successful build % had increased from 76.4% to 89.1%. Creating Jenkins libraries to provide common functionality across all pipelines and reducing technical debt. Being first point of contact for CI/CD issues for ~100 developers across 10+ dev teams. Automating migration of source repositories from legacy phabricator/gitolite to GitHub.
- Mentoring: Writing requirements for, supporting and helping deliver succesful intern projects. Acting as a mentor for new hires when it comes to Jenkins, Git, Sonarqube and other DevOps tools. Upskilling team members in cloud technologies such as Docker, Kubernetes and Terraform.
Sep 2019 - Dec 2020
DXC Technology
Graduate Software Engineer
Design Office
- Requirements, design and implementation of an environmental monitoring solution for secure on-site server rooms. Python, REST, RPi.
- Regular creation of RPM packages for the patching, wrapping and deployment of third-party and internal libraries.
- Full manual integration test cycles.
Development
- Bug fixes and enhancements on a custom RHEL distribution that the application is built ontop of. Both RHEL 5 and RHEL 7.
- Code review, integration testing and release process for custom RHEL distribution.
- Maintaining and upgrading secure gateway networking hardware, for example firewall migration, server configuration and switch configuration.
- Building out new CI/CD systems to support a legacy code base, for example creating a custom clearcase plugin for Jenkins.
Installation
- Travelling to customer sites and performing system installations, upgrades and migrations.
Sep 2018 - Feb 2019
University of Portsmouth
Software Engineering Teaching Assistant
- Teaching weekly classes of 10-15 student’s different parts of the ‘Introduction to Software Engineering’ module.
- Running weekly drop-ins to provide technical aid, mediate group-work issues and give documentation feedback.
- Attending monthly 1-to-1s with the module coordinator to provide feedback about student performance
Jul 2017 - Sep 2018
Verint Systems
Intern DevOps Engineer
- Daily troubleshooting of existing CI/CD systems to keep developers working. Jenkins, Sonarqube, Artifactory, Git, Gradle, npm etc.
- Creation of new CI/CD systems for new projects or when other companies were acquired. Jenkins, Groovy, Bash etc.
- Driving adoption of Jenkins pipeline as code, stabilising CI/CD systems across the organisation by tracking pipelines within SCM.
- Breaking monolithic legacy C++ pipelines into smaller faster pipelines via creation of Nuget packages, reducing CI cycle time by 66%.
- Working with other interns to create a collection of JS widgets for dashboard reporting of biometric data, this was included in the product.
- Creating a utility for importing Avaya Call Manager extension information directly into the product.
- Winning a hackathon by shifting left static analysis CI stages by creating a sonarqube bot for phabricator.
- Maintaining WiX installers and resolving high prioirty bugs for customers related to WiX installers.
Technology
Docker
Several years experience using Docker.
Comfortable containerizing applications, securing containers, shipping containers to production.
I like containers as they are even more versatile than people think, you can use them to remove dependencies from almost any CI/CD environment and create extremely useful tools using executable images.
Kubernetes
Working with Kubernetes daily for 3 years I have a solid understanding of fundamentals and intermediate areas. I have ventured into expert topics such as writing controllers.
Terraform
Using Terraform daily.
Comfortable writing HCL for provisioning AWS services. Can combine multiple services into Terraform modules for provisioning architecturally defined sections of infrastructre
IaC is another critical piece of a successful cloud native project.
Flux
Using flux daily.
AWS
AWS is the only cloud service provider I have worked with in a professional capacity, I have personal experience with GCP.
Bash & Linux
5+ years writing bash scripts to carry out CI/CD tasks.
Comfortable operating from only the command line in all major Linux distributions, my homelab PC runs Fedora.
Jenkins
5+ years writing Jenkins pipeline as code to carry out CI/CD tasks.
Git
An effective Git adminstrator, can operate git via CLI and operate as a source of guidance for developers.
Familiar with implementing GitOps, maintaining all application config in SCM.
Education
2015 - 2019
University of Portsmouth
- BSC Hons Computer Science (1st Class, GPA 4.11)
2012 - 2014
Collingwood College, Surrey
- Computing, Maths, Physics
2007 - 2012
Collingwood College, Surrey
- Computing, Maths, Sciences, Electronics