Intermediate Site Reliability Engineer, Database Operations
GitLab is an open-core software company that develops the AI-powered DevSecOps Platform used by more than 100,000 organizations. Our mission is to enable everyone to contribute to and co-create the software that powers our world. We embrace AI as a core productivity multiplier and expect team members to incorporate AI into daily workflows to drive efficiency, innovation, and impact.
Overview
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs blend pragmatic operations with software craftsmanship, applying engineering principles, operational discipline, and automation to our environments and the GitLab codebase. We specialize in systems, including networking, the Linux kernel, and distributed systems.
The Database Operations team’s mission is to build, run, own and evolve the entire lifecycle of the PostgreSQL database engine for GitLab.com. The team focuses on reliability, scalability, evolution, performance, and security of the database engine and its supporting services. We build services on top of Reliability::Foundations services and cloud vendor managed products where appropriate to reduce complexity and deliver new capabilities faster. GitLab.com is one of the largest single-tenancy open-source SaaS sites on the internet and the knowledge from this team informs other engineering groups and customers running self-managed installations.
Responsibilities
- Automating every operational task as a core requirement (e.g., package updates, configuration changes across environments, automatic provisioning tools for user-facing services).
- Responding to platform emergencies, alerts, and escalations from Customer Support.
- Ensure systems exist to manage software lifecycles (e.g., operating systems) with minimal manual effort.
- Develop a fully automated multi-environment observability stack and extend it to predict capacity based on usage patterns.
- Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimize resource consumption.
As an SRE you will
- Work on database reliability and performance for GitLab.com within the SRE team and with product teams.
- Analyze solutions and implement best practices for PostgreSQL clusters and components.
- Improve observability of database metrics to meet objectives.
- Collaborate with peer SREs to roll out changes and mitigate production incidents.
- On-call support on rotation.
- Provide database expertise to engineering teams (e.g., reviews of migrations, queries and performance optimizations).
- Automate database infrastructure and provide self-service tools for engineering.
- Use GitLab to run GitLab.com as a first-resort and help improve the product.
- Plan the growth of GitLab’s database infrastructure.
- Design, build and maintain core database infrastructure components to support high concurrency.
- Support and debug production issues across services and stack levels.
- Monitor and alert on symptoms rather than outages; document actions for repeatability and automation.
You may be a fit to this role if you
- Have primary experience running PostgreSQL in high-growth production environments using self-managed (VM, Kubernetes with PostgreSQL Operators) and DBaaS services.
- Have hands-on experience using PostgreSQL internals for design, build and troubleshooting.
- Have experience with infrastructure automation and configuration management (Chef, Ansible, Puppet, Terraform).
- Have solid understanding of SQL and PL/pgSQL.
- Have significant experience in a Large SaaS distributed systems production environment.
- Align with our values and collaborate accordingly.
- Have excellent written and verbal English communication skills with asynchronous collaboration.
- Document everything to enable rapid delivery and iteration.
- Proactive, go-for-it attitude; when something is broken, you work to fix it.
- Solid data modeling and data structure design skills.
- Bonus: Programming skills as a backend engineer (Ruby and/or Go).
- Bonus: Experience with ClickHouse or other modern OLAP databases.
Projects you could work on
- Review, analyze and implement solutions for database administration (backups, performance tuning).
- Build automation with Ansible, Terraform, Chef to automate replicas, testing, and backup monitoring.
- Provide self-service tools for engineers using GitLab ChatOps.
- Offer technical assistance on database design methodologies and tuning.
- Review database migrations and changes from engineering teams.
- Recommend query and schema changes to optimize performance.
- Respond to production incidents to mitigate database-related issues.
- Contribute to infrastructure design and scalability considerations focused on data storage.
- Plan steps to scale the database for future needs.
- Design and develop specifications for future database requirements, including capacity planning and evaluations of alternatives.
Intermediate Site Reliability Engineer Criteria
Technical
- Expertise in at least one area of SRE work, with general knowledge across areas.
- Ability to mentor junior team members.
- Contributes small improvements to the GitLab codebase to resolve issues.
Execution
- Identify projects that yield substantial cost savings or revenue.
- Suggest product architecture changes from reliability, performance and availability perspectives using data-driven approaches.
- Improve efficiency and capacity planning to reduce resource usage and cost for customers.
- Identify parts of the system that do not scale, provide immediate fixes and drive long-term resolution.
- Identify SLIs to align the team with availability and latency objectives.
Collaboration and Communication
- Thrives in a fully remote, asynchronous environment with emphasis on documentation and written communication.
- Develop domain expertise and share knowledge widely.
- Participate in blameless RCAs to prevent recurrence of incidents.
Influence and Maturity
- Lead junior SREs by example.
- Develop ownership of a major part of the infrastructure.
- Trusted to de-escalate conflicts within the team.
Performance Indicators
Site Reliability Engineers have the following job-family performance indicators.
Country Hiring Guidelines: GitLab hires globally. All roles are remote, though location-based eligibility may apply. Our Talent Acquisition team can answer questions about location after starting the process.
GitLab is an equal opportunity workplace and an affirmative action employer. Our recruitment and employment practices are merit-based and non-discriminatory. If you require accommodation during the interview process, please let us know.
Apply for this job
*
indicates a required field
First Name *
Last Name *
Email *
Phone
Country
Phone
Location (City) *
Resume/CV *
Enter manually
Accepted file types: pdf, doc, docx, txt, rtf
LinkedIn Profile
What's the name you prefer during interviews?
Are you subject to any employment agreements or post-employment restrictions?
Country of location if hired
Visa sponsorship now or in the future?
Experience with Postgres at scale?
Experience with Chef or Ansible (or similar)?
Experience with Terraform?
Have you previously worked at or consulted for GitLab?
Equal Employment Opportunity and Accessibility
GitLab is an equal opportunity workplace. If you require accessibility adjustments, please indicate during the interview process. We are committed to an accessible interview experience.