Enable job alerts via email!

Staff Software Engineer - Platform Engineering & SRE

Equinix

Toronto

Hybrid

CAD 100,000 - 130,000

Full time

4 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Staff Software Engineer specializing in Platform Engineering and Site Reliability Engineering. This role involves developing and maintaining system reliability, automation, and performance. Candidates should have extensive experience in data platforms and strong programming skills, with a focus on collaboration and problem-solving.

Qualifications

  • 5+ years of experience in Data Platform, SRE, DevOps, or Systems Engineering.
  • Strong programming skills in Python, Java, or similar.

Responsibilities

  • Ensure high availability and reliability of production systems.
  • Automate deployment, scaling, and monitoring of systems.
  • Participate in incident response and troubleshooting efforts.

Skills

Python
Java
Analytical Skills
Problem Solving
Collaboration

Tools

Terraform
CloudFormation
Prometheus
Grafana
Docker
Kubernetes

Job description

Staff Software Engineer - Platform Engineering & SRE page is loaded

Staff Software Engineer - Platform Engineering & SRE

Apply remote type: Hybrid | locations: Toronto Office | time type: Full time | posted on: Posted 30+ Days Ago | job requisition id: JR-149965

Who are we?

Equinix is the world’s digital infrastructure company, operating over 260 data centers across the globe. Digital leaders harness Equinix's trusted platform to bring together and interconnect foundational infrastructure at software speed. Equinix enables organizations to access all the right places, partners, and possibilities to scale with agility, speed the launch of digital services, deliver world-class experiences, and multiply their value, while supporting their sustainability goals.

Our culture is based on collaboration and the growth and development of our teams. We hire hardworking people who thrive on solving challenging problems and give them opportunities to hone new skills and try new approaches, as we grow our product portfolio with new software and network architecture solutions. We embrace diversity in thought and contribution and are committed to providing an equitable work environment that is foundational to our core values as a company and is vital to our success.

Job Description

We are looking for a highly skilled and motivated Platform Engineering & SRE Staff Engineer to join our team. As a Platform Engineering SRE, you will play a critical role in developing, maintaining, and improving the reliability, scalability, and performance of our systems, ensuring seamless user experiences. This position blends software engineering and systems engineering expertise to create automated solutions for operational challenges.

Key Responsibilities
  • Reliability and Performance: Ensure the high availability, reliability, and performance of production systems and services. Implement and maintain disaster recovery plans and procedures. Monitor and manage system health using metrics, logs, and tracing to proactively identify and resolve issues.
  • Automation and Infrastructure: Automate repetitive tasks, including deployment, scaling, monitoring, and remediation of systems. Build and maintain infrastructure as code (IaC) using tools like Terraform, CloudFormation, or similar.
  • Incident Management: Participate in incident response and troubleshooting efforts to minimize downtime and resolve issues quickly. Conduct root cause analysis for system failures and implement preventive measures to avoid future incidents. Maintain incident response playbooks and ensure efficient on-call rotations.
  • Observability and Monitoring: Design and implement monitoring solutions using tools like Prometheus, Grafana, Datadog, or similar.
  • Collaboration: Work closely with development, QA, and operations teams to ensure smooth delivery of applications. Act as a bridge between software engineering and operations, advocating for DevOps best practices. Document system configurations, processes, and procedures to ensure knowledge sharing and maintain system integrity.
  • Capacity and Scalability: Conduct capacity planning and optimize system scalability to meet future demands. Implement strategies for horizontal and vertical scaling of applications.
  • Security and Compliance: Ensure infrastructure security by implementing best practices and addressing vulnerabilities. Collaborate with the security team to meet compliance standards and audits.
  • Data Engineering & Automation: Develop and maintain scalable and efficient data pipelines. Automate data workflows for ETL/ELT processes, integrating data from various sources into data warehouses and other storage solutions. Develop and maintain solutions for data transformation, data modeling, and automate the orchestration of data processing.
  • Data Warehouse Management: Implement and maintain modern data warehouse architectures, ensuring effective data storage, retrieval, and accessibility. Work with cloud-based data warehouses (e.g., BigQuery, Snowflake, Redshift) and optimize data models for analytics and reporting. Develop and manage dimensional models, star/snowflake schemas, and data marts for operational and analytical use cases.
  • Real-time and Batch Data Processing: Build and manage real-time and batch data pipelines for high-volume data ingestion, processing, and analytics. Leverage technologies such as Apache Kafka, Apache Beam, Apache Spark, and Google Cloud Dataflow for streaming and batch processing.
Qualifications

Experience: 5+ years of experience in a Data Platform including Site Reliability Engineering, DevOps, or Systems Engineering role.

Technical Skills: Strong programming skills in languages such as Python, Java, or similar. Experience in developing Data ingestion pipelines, Governance, Quality, and automation. Experience in cloud platforms such as Google Cloud / AWS / Azure. Hands-on experience with CI/CD pipelines using tools like GitHub Actions, Jenkins. Exposure to containerization and orchestration technologies like Docker and Kubernetes. Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).

Methodologies: Knowledge of Software Engineering, Data Modelling, and SDLC. Understanding of SRE principles, including SLIs, SLOs, and error budgets. Knowledge of incident management frameworks and root cause analysis techniques.

Soft Skills: Strong analytical and problem-solving skills. Excellent communication and collaboration abilities.

Preferred Qualifications: Familiarity with configuration management tools (e.g., Ansible, Puppet, Chef). Background in performance testing and load testing.

Additional Information

Equinix is committed to ensuring that our employment process is open to all individuals, including those with a disability. If you need assistance or an accommodation, please let us know by completing this form.

We are an Equal Employment Opportunity and, in the U.S., an Affirmative Action employer. All qualified applicants will receive consideration without regard to race, color, religion, sex, etc.

About Us

Equinix is an Equal Employment Opportunity and Affirmative Action employer. All qualified applicants will receive consideration without regard to race, color, religion, etc.

(US Applicants) Please see the “Know Your Rights” poster and our EEO Policy and Pay Transparency statements via provided links.

Equinix participates in E-Verify. For more information, visit E-Verify.

If you require assistance or accommodations, please let us know by completing the form.

We maintain a list of preferred recruiting agencies. Please contact our HR department for inquiries.

Privacy and Terms

We retain your information for the purpose of sending you career information. By applying, you agree to our privacy & terms. You can unsubscribe at any time.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Observability Engineer - Platform Reliability (Junior to Mid-Level)

Releady

Calgary

Remote

CAD 125,000 - 150,000

18 days ago

Staff Platform Engineer

Global Trade Plaza

Old Toronto

Remote

CAD 80,000 - 120,000

30+ days ago

Platform Engineer II, Cloud Operations

WeAreTechWomen

Toronto

On-site

CAD 70,000 - 110,000

30+ days ago