Enable job alerts via email!

Site Reliability Engineer

Softworld, a Kelly Company

Detroit (MI)

Remote

USD 100,000 - 125,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is seeking a Cloud Site Reliability Engineer to enhance monitoring and automation across systems. This role involves collaborating with development and IT operations teams to streamline processes, ensuring system reliability and performance. The ideal candidate will have extensive experience in Azure, automation tools, and cloud technologies, with a strong focus on problem-solving and continuous improvement. Join a dynamic environment where your contributions will significantly impact operational efficiency and system resilience, making a difference in the technology landscape.

Qualifications

3+ years experience as a Site Reliability Engineer in a cross-functional agile team.
Experience with IaC tools and cloud technologies.

Responsibilities

Automate processes and develop monitoring capabilities for cloud services.
Collaborate with teams to enhance system stability and performance.

Skills

Site Reliability Engineering

Azure

Terraform

GitHub

Ansible

Packer

Agile Development

Problem-solving

Cloud Technologies

On-premise to Cloud Migration

Tools

Azure DevOps

Softworld, a Kelly Company provided pay range

This range is provided by Softworld, a Kelly Company. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$70.00/hr - $80.00/hr

Job Title: Site Reliability Engineer
Job Location: Detroit MI 48228
Onsite Requirements: Remote

Job Description:

The Cloud Site Reliability Engineer (SRE) works closely with the cloud development team, IT operations team, and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure and application layers.
By leveraging automation tools, SREs address and resolve issues, minimizing manual workload and enhancing system scalability and reliability.
Their core focus lies in standardization and automation to build and run fault-tolerant systems.
Typically, SREs possess a background in software engineering, system engineering, or system administration, coupled with substantial IT operations experience.
SREs oversee availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Key Accountabilities:

Writing and developing code to automate processes, such as analyzing logs, testing production environments, and responding to any issues.
Collaborates with agile teams and business partners to develop specifications that resolve problems and enhancement needs, including focusing on monitoring and metrics for operational readiness.
Identify bottlenecks in development and deployment processes and design automation solutions to mitigate.
Develop new capabilities in displaying/monitoring/alerting on key performance indicators by tracking business transactions in real-time.
Maintain and grow knowledge of platform configuration management, monitoring of established metrics, and troubleshooting.
Provides continuous feedback to development teams on system stability, defect analysis, and system enhancements.
Design and develop alert escalation and incident response automation.
Provide production support for cloud service outages and incidents and work on both tactical and strategic plans for outage prevention.
Provide feedback on resiliency and maintainability of solutions to Cloud and App architects.
Conduct disaster recovery scenario generation and testing.
Implement sustainable, audit-ready processes that support information technology controls, including deployment execution, access management, audits, incident management, and related requirements.

Must-have Technical Skills:

Should have at least 3 years' experience as a site reliability engineer on a cross-functional agile team working in Azure.
Have working knowledge of agile development methodologies (scrum, sprints, Kanban, etc.) and tools (Azure DevOps, etc.).
Have at least 3 years hands-on experience using IaC tools Terraform, GitHub, Ansible, and Packer.
Proven experience across testing, integration, source code management, deployment, and containerization.
Sound problem-solving skills with the ability to quickly process complex information and present it clearly and simply.
Experience with cloud technologies and services including those for Compute, Storage, Databases, and API Management.
On-premise to cloud migration experience.

Required Non-technical Soft Skills:

Strong communication skills and ability to manage complex technical decisions.
Be a team player and coach, share knowledge, and work towards building a trusted, passionate team.
Be a thinker and not an order taker. Have the courage and ability to think, understand, question before doing.
Have the courage to push back and say 'NO' if that is the right thing to do for DTE.
Have a continuous improvement mindset and be open to constantly finding better ways of solving security issues.

This position requires candidates to be eligible to work in the United States, directly for an employer, without sponsorship now or anytime in the future.

Seniority level

Mid-Senior level

Employment type

Contract

Job function

Engineering, Information Technology, and Other

Industries

Data Infrastructure and Analytics, Information Services, and Technology, Information and Internet

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs