Make a game-changing next move.

Learn more about the opportunities in Coatue's portfolio.
companies
Jobs

Field Reliability Engineer

Cerebras

Cerebras

Software Engineering
Sunnyvale, CA, USA
USD 150k-250k / year + Equity
Posted on Dec 11, 2025

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.

Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.

About The Role

Quality, reliability, and uptime are foundational to scaling Cerebras systems and impact. We are looking for engineers passionate about diagnosing complex field failures, extracting insights from large-scale telemetry and service datasets, and partnering across hardware, software, operations, and supply chain teams to improve reliability at fleet scale. This role blends deep engineering domain knowledge with data analytics and reliability statistics to drive continuous improvement across Cerebras’ growing deployed base.

Responsibilities

  • Use reliability statistics (e.g., Weibull and other parametric/non-parametric survival models) to identify and address trends, risks, and fleet-level performance of Cerebras’ datacenter compute hardware
  • Lead physics-of-failure–based root-cause investigations using telemetry, log data, stress/usage analysis, and engineering intuition.
  • Build and maintain statistical and large-scale data analyses (e.g., event logs, thermal/power telemetry, workload patterns).
  • Develop reliability forecasts to inform design decisions, manufacturing quality, capacity planning, service readiness, and supply chain strategy.
  • Build warranty cost and failure-forecast models by integrating failure rates, usage profiles, reliability statistics, and component risk factors.
  • Analyze real-world stress, workload, thermal, and environmental conditions to refine design requirements, qualification plans, and reliability tests.
  • Partner cross-functionally to prioritize issues, align mitigations, drive corrective actions, and turn learnings into design/process guidelines to prevent issue recurrence.

Skills & Qualifications

Required

  • Bachelor’s degree in Electrical Engineering, Materials Science, Mechanical Engineering, or a related field.
  • 5+ years of industry experience in reliability engineering, hardware quality, or field failure analysis.
  • Strong proficiency in applied statistics and reliability methods (e.g., Weibull/survival analysis modeling, accelerated aging models).
  • Experience applying Weibull analysis and fleet-scale failure modeling to drive reliability priorities and quantify risk.
  • Working knowledge of Python and SQL for data extraction, cleaning, time-series analysis, reliability modeling, and visualization.
  • Demonstrated ability to build structured problem-solving approaches and lead cross-functional teams through complex root-cause investigations.
  • Excellent communication skills, with the ability to distill complex data and engineering concepts into clear, concise insights for technical and executive audiences.

Preferred

  • Physics-of-failure knowledge related to datacenter compute: thermal cycling, solder/interconnect fatigue, power electronics degradation, connector reliability, and cooling system failure modes.
  • Familiarity with the design and manufacturing process for IC packaging, server hardware, and PCBA.
  • Understanding of datacenter operating conditions: airflow, thermal management, power quality, workload variation, and system-level interactions.
  • Experience analyzing large-scale system telemetry, preferably from instrumented hardware fleets.

The base salary range for this position is $150,000 to $250,000 annually. Actual compensation may include bonus and equity, and will be determined based on factors such as experience, skills, and qualifications.

Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

  1. Build a breakthrough AI platform beyond the constraints of the GPU.
  2. Publish and open source their cutting-edge AI research.
  3. Work on one of the fastest AI supercomputers in the world.
  4. Enjoy job stability with startup vitality.
  5. Our simple, non-corporate work culture that respects individual beliefs.

Read our blog: Five Reasons to Join Cerebras in 2025.

Apply today and become part of the forefront of groundbreaking advancements in AI!


Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.


This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.