Enable job alerts via email!

Principal Engineer - Platform Management & Observability

Graphcore

Bristol

On-site

GBP 80,000 - 120,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in AI technology, Graphcore is seeking a Principal Engineer to lead scalable management solutions for AI infrastructure. The role involves collaborating with various teams to ensure user-friendly solutions and mentoring junior engineers. This position offers a competitive salary, flexible working, and comprehensive benefits in a diverse and inclusive environment.

Benefits

Private medical insurance
Pension
Parental leave
Flexible working

Qualifications

  • 14+ years of relevant experience post-degree.
  • Proven success in designing scalable cluster management systems.
  • Experience with monitoring tools like Datadog, Dynatrace, or Splunk.

Responsibilities

  • Lead architecture, implementation, and deployment of scalable management solutions.
  • Collaborate with teams for proof-of-concepts and third-party integrations.
  • Mentor junior engineers and promote continuous learning.

Skills

Telemetry analysis
RESTful APIs
Automation tools
C/C++/Go
Python
Containerization
Linux environments
System management protocols
Fault-remediation solutions
Commercial observability solutions

Education

BSc or MSc in Computer Engineering, Computer Science, or related field

Tools

Prometheus
Grafana
OpenTelemetry
Docker
Jira
Confluence

Job description

Principal Engineer - Platform Management & Observability

Bristol, UK

About Graphcore

How often do you get the chance to build technology that transforms the future of humanity? Graphcore's products set the standard in AI hardware and software, gaining global recognition and industry acclaim. We are developing the next generation of AI compute systems that enable researchers, scientists, and businesses worldwide to advance their AI capabilities. Recently, Graphcore joined SoftBank Group, attracting significant investment from a leading AI investor.

Job Summary

As the Principal Engineer, you will lead the architecture, implementation, and deployment of scalable management solutions for AI infrastructure, focusing on monitoring, observability, control, and data center infrastructure management. You will collaborate with software, cloud, and customer-facing teams to develop proof-of-concepts, reference designs, and third-party integrations.

Your team will work with product and architecture teams to ensure solutions are complete, simple to deploy, and user-friendly, both internally and for customers.

Responsibilities and Duties
  1. Contribute to all phases of product development, from definition and architecture to implementation, testing, and customer support.
  2. Evaluate new technologies and innovations to anticipate customer needs and develop strategies for data center management solutions.
  3. Refine requirements with product management and customer-facing teams.
  4. Architect solutions, manage multi-component integrations, and ensure seamless management, monitoring, and UI.
  5. Validate architectural decisions through proof-of-concepts.
  6. Deploy solutions internally for debugging, performance analysis, benchmarking, and testing.
  7. Collaborate with development and QA teams to ensure thorough testing and quality assurance.
  8. Design and implement fault-remediation solutions at scale.
  9. Mentor junior engineers and promote continuous learning.
Skills and Experience
  1. BSc or MSc in Computer Engineering, Computer Science, or related field, or equivalent experience.
  2. Proven success in designing and implementing scalable, reliable cluster management systems, including telemetry analysis.
  3. Expertise in in-band and out-of-band management architectures and tools.
  4. Knowledge of system management protocols like Redfish and IPMI.
  5. Understanding of secure hardware monitoring and observability data collection.
  6. Experience with technologies such as Prometheus, Grafana, OpenTelemetry, Clickhouse, Kafka, and Superset.
  7. Experience with monitoring tools like Datadog, Dynatrace, or Splunk.
  8. Strong skills in designing RESTful APIs and automation tools like Ansible.
  9. Excellent communication skills.
  10. Proficiency in at least one of C/C++/Go and Python.
  11. Server programming and debugging experience.
  12. Experience with containerization (e.g., Docker) and Linux environments.
  13. Familiarity with project management tools like Jira and Confluence.
  14. 14+ years of relevant experience post-degree.
  15. Experience with system software for accelerators (GPUs, DPUs, FPGAs).
  16. Knowledge of Redfish APIs, Open Compute, and DMTF standards.
  17. Ability to prototype ideas and evaluate their value objectively.
  18. Knowledge of cloud-native deployment (SaaS/PaaS/IaaS).
  19. Understanding of data center design, networking, and monitoring best practices.
  20. Contributions to open-source communities are a plus.
  21. Working knowledge of commercial observability solutions and monitoring in hyperscale environments.
  22. Familiarity with declarative management systems.

Graphcore offers a competitive salary, flexible working, comprehensive benefits including private medical insurance, pension, parental leave, and more. We value diversity and inclusion and are committed to creating an equitable work environment. Note: Applicants must have the right to work in the UK; we are unable to sponsor visas at this time.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.