Job Search and Career Advice Platform

Enable job alerts via email!

Engineering Lead Analyst

Citi

Mississauga

On-site

CAD 164,000 - 233,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global financial services company is seeking an Infra & DevOps Engineer to enhance infrastructure and DevOps practices. This hands-on role involves implementing and optimizing systems for GenAI and high-performance workloads. Ideal candidates have 6+ years of experience in Infrastructure or DevOps, strong knowledge of containerization, and proficiency in scripting languages. This role supports a collaborative environment, emphasizing CI/CD operations, automation, and emerging technologies.

Qualifications

  • 6+ years of overall experience, with 5+ years in Infrastructure or DevOps roles.
  • Experience with GPU infrastructure for GenAI is advantageous.
  • Strong knowledge of cloud environments and container technologies.

Responsibilities

  • Implement and maintain critical infrastructure components including GPU clusters.
  • Contribute to CI/CD operations and automation frameworks.
  • Monitor and troubleshoot system performance and reliability.
  • Develop automation scripts to enhance operational efficiency.
  • Support emerging technologies relevant to GenAI.

Skills

Infrastructure expertise
DevOps practices
Scripting (Python, Bash)
Cloud environments
Containerization (Docker, Kubernetes)
CI/CD processes
Problem-solving
Communication skills

Education

Bachelor's degree in computer science or related field

Tools

AWS
GCP
Terraform
Ansible
Git
ELK Stack
Job description

As an Infra & DevOps Engineer, you will join a dynamic team in the Citi Innovation Labs under the CTO organization. You will operate within NAM hours, complementing our existing team primarily based in Israel (EMEA hours). Your expertise will be vital in strengthening our infrastructure and DevOps practices, directly contributing to faster and more reliable software delivery. This role is deeply hands‑on, focusing on implementing, maintaining, and optimizing critical systems that foster innovation and support our scalable, resilient, and secure infrastructure. You will be an active team player, bringing specialized technical skills to address operational challenges, implement advanced solutions, and collaborate closely to achieve our collective goals, especially within high‑performance and GenAI environments.

Key Responsibilities
  • Core System Implementation: Implement and maintain essential infrastructure components, including specific configurations for on‑prem GPU clusters (V100/A100/H100/H200 MIG) that underpin GenAI and high‑performance workloads, ensuring operational stability.

  • CI/CD Operations & Improvement: Contribute to the efficient operation and continuous improvement of our CI/CD pipelines and automation frameworks. Leverage and contribute to our GitHub repositories to streamline development and deployment processes.

  • System Reliability & Performance: Monitor, troubleshoot, and optimize system reliability and performance across various environments. Work with the team to identify and resolve critical issues promptly, ensuring a high level of operational availability and client satisfaction.

  • Automation Development: Develop and implement automation scripts and tools to enhance operational efficiency, reduce manual effort, and improve the consistency of our infrastructure and deployment processes.

  • Emerging Technology Support: Provide hands‑on support for the deployment ongoing operation of emerging technologies relevant to GenAI, such as NIM images, MLflow 3.x, Coder, and LLMOps infrastructure. Actively contribute to the setup and maintenance of experimentation platforms like GCP Sandbox.

  • Operational Best Practices: Adhere to and actively contribute to established operational best practices, documentation, and runbooks to ensure consistency and maintainability of our systems.

  • Team Collaboration: Work seamlessly within the team, participating in discussions, sharing insights, and collaborating with colleagues and development partners to achieve shared objectives.

Skills & Experience Required
  • 6+ years of overall work experience, specifically 5+ years of dedicated, hands‑on technical experience in Infrastructure, Site Reliability Engineering (SRE), or DevOps roles, with a proven ability to contribute significantly to complex operational environments.
  • Proven practical experience in working with and optimizing GPU infrastructure for GenAI and high‑performance computing - is an advantage.
  • Strong practical knowledge of cloud environments, containerization technologies (Docker, Kubernetes, OpenShift), and operational aspects of serverless computing.
  • Proficiency in scripting languages (e.g., Python, Bash) for system automation, configuration, and diagnostics.
  • Demonstrated experience in implementing and operating CI/CD pipelines, infrastructure‑as‑code principles, and automation solutions, with solid experience using GitHub.
  • Understanding of and ability to apply enterprise security best practices, compliance standards, and data privacy considerations in daily operations.
  • Solid problem‑solving skills with an ability to diagnose and resolve technical issues effectively in production environments.
  • Strong communication and interpersonal skills, fostering effective teamwork and collaboration within a diverse, global team.
  • Bachelor’s degree in computer science, engineering, or a related technical field, or equivalent practical experience.
Tech Stack Expertise
  • Cloud Platforms: AWS, GCP (Operational experience).
  • GPU Infrastructure: NVIDIA V100/A100/H100 /H200 clusters, MIG (Practical operational experience).
  • Scripting & Automation: Python, Bash.
  • CI/CD Orchestration: Tekton, Harness, CI/CD for GenAI workloads.
  • Version Control & Collaboration: Git, GitHub Enterprise, Jira, Confluence.
  • Database Technologies: MongoDB/MaaS, PostgreSQL and Redis (Operational knowledge).
  • Operating Systems: Linux, Wintel (System administration experience).
  • Containerization & Orchestration: Docker, Kubernetes, OpenShift (Hands‑on operational experience).
  • Networking: Load Balancers, DNS.
  • Monitoring & Observability: ELK Stack, Prometheus, Grafana, ITRS (Practical operational experience).
  • Infrastructure as Code: Terraform, Ansible (or similar) (Practical application).
  • Developer Productivity Tools: GitHub Copilot, StackOverflow for Teams, Devin, Delphine.
  • Service Mesh: Practical operational experience.
Education
  • Bachelor’s degree/University degree or equivalent experience
  • Master’s degree preferred
Job Family Group

Technology

Job Family

Systems & Engineering

Time Type

Full time

Primary Location Full Time Salary Range

$120,800.00 - $170,800.00

Automated Processing and AI

We use automated processing, including artificial intelligence, for our legitimate business interests (or our reasonable and appropriate business purposes) to identify and align the candidate's skills and abilities with a specific job opening. Additionally, if you so choose, or consent, we can match your skills and abilities to other suitable roles at Citi.

Importantly, all our hiring processes and decisions, including determining your suitability for a role, are conducted, checked, and decided by individuals. Our automated processing and AI do not involve relying on automatic or autonomous decision‑making. Please refer to any Jurisdictional Considerations, with specific provisions for your country (where relevant) for further details.

Equal Opportunity

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity, review Accessibility at Citi. View Citi’s EEO Policy Statement and the Know Your Rights poster.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.