Home

Backend Engineer AI Reliability

See All Posts

About Us

explai is a fast-moving Berlin AI startup on a mission to simplify data analytics for non-experts by building agents that automate data science workflows with deep business knowledge. We incorporated in 2024, launched our public sandbox in spring 2025, had run several proof of value by summer and are now ramping up our first paid customer installations. We work remotely today but plan to open our Berlin office in 2026. We are bright, no-nonsense problem-solvers who move fast with high autonomy, minimal meetings, and maximum mutual respect. Don’t come for the money, though we pay fairly. You will work hard (we ship almost daily), learn a ton, and make friends for life.

The Role

Joining a team of five you will touch any code line of our backend giving you a steep learning curve to work with a multi agent application on a modern tech stack including GCP/AWS, LangSmith/Langfuse, Python data science stack, Postgres, Snowflake, Docker, Redis, LangGraph, FastAPI, React/Next.js, Ray, Spark. As we are ramping up our first enterprise customers you will focus on making our solution reliable in production from both the software and AI side.

What You'll Do

  • Build monitoring and observability tools to track agent performance, decision quality, and system health across distributed deployments.
  • Design, implement, and manage CI/CD pipelines to enable smooth code integration, automated testing, and zero-downtime deployments across multiple environments.
  • Monitor, optimize, and scale cloud infrastructure (AWS/GCP/Azure) with a focus on high availability, cost efficiency, and security best practices.
  • Develop testing methodologies for non-deterministic AI systems, including adversarial testing and edge case discovery.
  • Optimize system performance to ensure agents can handle enterprise-scale workloads with minimal latency.
  • Collaborate with ML engineers to improve model reliability and reduce hallucinations.
  • Streamline data ingestion including quality safeguards as reliable data science input.

What We're Looking For

  • 5+ years of experience in reliability engineering, DevOps, or similar production systems roles
  • Strong Python skills with experience cloud computing and microservices architecture
  • Experience with ML/AI systems in production, understanding the unique challenges of non-deterministic systems
  • Deep knowledge of monitoring tools and cloud platforms (AWS, GCP, Azure)
  • Background in chaos engineering and building resilient systems that gracefully handle failures
  • Security mindset with experience implementing safeguards for automated systems
  • Startup mentality
    • comfortable with ambiguity, rapid iteration, and wearing multiple hats

Ready to Build the Future?

Apply at jobs@explai.com with a cover letter (why you, why us, location, salary expectations, start date) and CV. Only candidates with an existing EU work permit and residence can apply. We love AI – but hate them writing cover letters. Give us a sense of who you really are!