What you will do
- Drive operational excellence by proactively monitoring and optimizing system performance at scale;
- Ensure system stability and reliability by integrating modern testing techniques throughout the development lifecycle;
- Design, implement, and maintain backend services supporting large-scale event streaming systems using Java and Kafka;
- Establish and maintain SLAs/SLOs, track system health, and build tooling to measure and improve reliability;
- Define and maintain infrastructure using Terraform for reproducible, scalable deployments in AWS;
- Collaborate with engineers and SREs to design event-driven solutions that meet functional and non-functional requirements;
- Automate environment provisioning and configuration through GitLab CI/CD pipelines.
Must haves
- At least 3 years of experience with Java;
- Experience with event-driven architectures handling high-throughput data streams (Kafka, SQS/SNS);
- Experience with Terraform;
- Hands-on experience with AWS;
- Proven ability to collaborate across roles and teams to design solutions that meet product requirements;
- Solid understanding of measuring reliability through SLAs/SLOs and fostering a culture driven by operational metrics;
- Upper-Intermediate English level.
Nice to haves
- Experience with Rust and/or Golang;
- Familiarity with big data processing technologies (e.g., Apache Spark);
- Experience building change data capture (CDC) pipelines;
- Experience with Kubernetes and ArgoCD.
AgileEngine is one of the Inc. 5000 fastest-growing companies in the US and a top-3 ranked dev shop according to Clutch. We create award-winning custom software solutions that help companies across 15+ industries change the lives of millions.
If you like a challenging environment where you’re working with the best and are encouraged to learn and experiment every day, there’s no better place — guaranteed! :)