Enable job alerts via email!

Exception - Engineering & IT

ICONMA

Newark (CA)

Remote

USD 120,000 - 160,000

Full time

10 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as an Exception - Engineering & IT professional, where your expertise in Site Reliability Engineering and DevOps will drive the success of cloud infrastructure. In this role, you will be at the forefront of managing AWS and OCI Cloud services, ensuring seamless uptime and performance. You'll lead the containerization of microservices, advocate for DevOps culture, and implement innovative monitoring solutions. With excellent growth opportunities and a commitment to a diverse workplace, this position is perfect for individuals eager to make a significant impact in the evolving tech landscape.

Benefits

Health Benefits
Referral Program
Growth Opportunities

Qualifications

  • 8+ years in SRE or DevOps Engineering with hands-on experience.
  • Strong skills in containerization using Docker and Kubernetes.

Responsibilities

  • Manage and enhance cloud infrastructure and services for reliability.
  • Lead containerization and deployment of microservices on Kubernetes.

Skills

Site Reliability Engineering (SRE)
DevOps Engineering
Containerization (Docker, Kubernetes)
Infrastructure as Code (IaC)
Monitoring Tools (Prometheus, Grafana)
Scripting (Python, Go, Bash)
Configuration Management (Ansible, Chef, Puppet)

Education

B.S. or M.S. in Computer Science or Engineering
AWS Cloud Certification or OCI Certification

Tools

Terraform
Kubernetes
Docker
Kafka
Spark
Presto
Airflow

Job description

Our Client, an EV Manufacturer company, is looking for an Exception - Engineering & IT for their Remote location.

Responsibilities:
  1. The purpose of this team is the Cloud Infrastructure Team and managed the AWS Cloud, OCI Cloud and all critical applications. This role will contribute to working on open VPN upgrades, AMQX upgrades, etc.
  2. Reliability Engineering: Own and enhance the reliability of services deployed across various cloud regions. You will proactively monitor, automate, and scale services to ensure seamless uptime and performance.
  3. Containerization & Microservices Deployment: Lead the containerization and deployment of microservices and data pipelines on Kubernetes, using Helm charts, ensuring best practices for scalability and fault tolerance.
  4. DevOps Advocacy: Foster and advocate for a DevOps culture that emphasizes automation, self-service, and engineering excellence. Enable development teams to manage and deploy applications seamlessly with minimal intervention.
  5. Performance Monitoring & Autoscaling: Implement autoscaling strategies and monitor the performance of applications and infrastructure with tools like Prometheus, Grafana, and other observability platforms.
  6. Site Reliability Engineering (SRE): Perform SRE tasks such as availability monitoring, incident response, post-mortem analysis, and preparing reliability reports for leadership and stakeholders.
  7. Tool Deployment & Maintenance: Deploy, configure, and maintain essential cloud services and tools including Kafka, Spark, Presto, Airflow, MQTT, and other microservices platforms in a cloud-native environment.
  8. Infrastructure as Code (IaC): Set up and manage cloud infrastructure using tools like Terraform, Cluster API, and other IaC frameworks, ensuring seamless provisioning, management, and scaling of resources.
  9. Automated Alerts & Recovery: Continuously enhance and automate alerting, incident detection, and recovery mechanisms for critical applications and services to minimize downtime and improve system reliability.
  10. On-Call Rotation: Participate in an on-call rotation to meet business SLAs, quickly troubleshoot and resolve issues, and document runbooks for consistent incident management processes.
  11. Agile Collaboration: Work closely with Product Owners, Engineering Managers, and cross-functional teams in Agile Scrum and Kanban workflows to deliver iterative improvements and meet evolving business needs.
  12. Impact Analysis & Incident Management: Perform impact analysis during incidents, collaborate with teams for root cause analysis, and implement preventive measures to avoid recurrence.
Requirements:
  1. B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  2. 8+ years in Site Reliability Engineering (SRE), DevOps Engineering, or related fields
  3. At least 4+ years of hands-on experience deploying, managing, and optimizing containerized applications using Docker and Kubernetes in both public and private cloud environments (AWS, GCP, Azure, etc.).
  4. 4+ years in Infrastructure-as-Code (IaC) using Terraform, Cluster API, or similar automation frameworks to manage cloud infrastructure.
  5. Experience in scripting or programming with Python, Go, Bash/Shell, or similar languages.
  6. Strong understanding of using Prometheus, Grafana, and other monitoring and observability tools.
  7. Ability to effectively diagnose and resolve performance bottlenecks within AWS at the infrastructure and application layers.
  8. Configuration Management: Experience with configuration management and automation tools such as Ansible, Chef, or Puppet (preferred but not required).
  9. Degrees or certifications required: AWS Cloud Certification or OCI Certification
  10. Minimum 7 years experience with Cloud architecture or engineering
  11. Minimum 7 years experience with DevOps
  12. Minimum 7 years experience with Kubernetes
Why Should You Apply?
  • Health Benefits
  • Referral Program
  • Excellent growth and advancement opportunities

As an equal opportunity employer, ICONMA provides an employment environment that supports and encourages the abilities of all persons without regard to race, color, religion, gender, sexual orientation, gender identity or expression, ethnicity, national origin, age, disability status, political affiliation, genetics, marital status, protected veteran status, or any other characteristic protected by federal, state, or local laws.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.