Position Summary
As Site Reliability Engineers, we are a team of hybrid systems and software engineers who take ownership of reliability, scalability, and other issues related to uptime and availability of Walmart's e-commerce/Retail and Enterprise platforms. We design and build tools to improve reliability, enable scaling, and prevent re-occurrence of problems to mission-critical products/services. We influence new architectures, standards, and methods for large-scale enterprise systems, and participate in on-call rotation to secure the system from issues.
What you’ll do:
Site Reliability Engineers are hybrid systems and software engineers who are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of Walmart’s e-commerce/Retail and Enterprise platform. Our goal is to build, scale and guard the systems that delight the customers.
- Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart e-commerce/Retail and Enterprise products.
- Engender reliability and availability starting with metrics and measurements.
- Enable scaling by providing tools, developing training and/or augmenting processes.
- Build tools/automate to prevent re-occurrence of problems to mission-critical products/services.
- Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
- Participate in capacity planning, demand forecasting, software performance analysis and system tuning.
- Develop a deep understanding of the numerous services and applications that come together to deliver Walmart e-commerce/Retail and Enterprise products.
- Design new tools to monitor and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues.
- Influence, design and create new architectures, standards, and methods for large-scale enterprise systems.
- Root-cause analysis of complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.
- Participate in on-call rotation.
- Secure the system from issues, be they real, perceived, or notional.
- High focus on collecting and inferring metric documentation to be used by others to build and maintain systems.
- Scripting and development responsibilities.
- Experience with configuration management tools such as Ansible, Saltstack, Chef, and Puppet.
- Build and drive the automation systems that maintain system health.
- Eliminate Single Points of failure and test disaster recovery and HA regularly.
What you’ll bring:
- 6+ years in a software development, DevOps role, or SRE role.
- Experience in designing, investigating, analyzing, and troubleshooting large-scale enterprise systems.
- Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative, and drive.
- Fluency with running services at scale; In-depth understanding of Unix systems internals and networking.
- Networking knowledge and in-depth understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. Experience administering Linux systems in a production environment.
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.
- Bachelor's Degree in Computer Science or a related field, or relevant work experience.
- Experience with distributed version control like Git or similar.
- Experience with IaaS and PaaS providers such as AWS, AZURE, OpenStack, GCP.
- Experience with containerization and container platforms (e.g., Docker, Kubernetes, Docker EE, OpenShift, Mesosphere).
- Experience with enterprise monitoring solutions like Dynatrace, AppDynamics, New Relic, Prometheus, Graphite, Grafana, Nagios, Sensu, and Splunk.
- Familiarity with continuous integration/deployment processes and tools such as Jenkins, Maven, Nexus, etc.
Minimum Qualifications:
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
- Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years’ experience in software engineering or related area.
- Option 2: 5 years’ experience in software engineering or related area.
Preferred Qualifications:
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
- Master’s degree in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, or related area and 1 year's experience in software engineering or related area.
- Knowledge in creating inclusive digital experiences, implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly.
Primary Location:
680 West California Avenue, Sunnyvale, CA 94086-4834, United States of America