Senior Data Engineer, Infrastructure Reliability
Amazon
Description
Join Amazon's Fulfillment Technologies & Robotics (FTR) team to build the data foundation powering the next generation of AI-enabled infrastructure reliability — a platform designed to keep Amazon's global fulfillment network running continuously, moving toward fully autonomous, zero-touch operations.
As a Data Engineer III on the Infrastructure Reliability team, you will design, build, and scale the data pipelines, models, and warehousing infrastructure that feed machine learning systems, multi-agent orchestration platforms, and real-time observability tools across thousands of fulfillment sites. Your work will be operationally critical, technically motivated, and globally impactful. If you are energized by hard data problems at enormous scale — and want your work to matter in production every single day — this is the role for you.
Key job responsibilities
- Design, build, and maintain scalable ETL/ELT pipelines that ingest, transform, and serve operational data from thousands of fulfillment sites to ML models, detection systems, and dashboards
- Develop and own data models that guide AI-powered progressive incident detection, consolidation, and remediation orchestration across cross-domain fulfillment systems
- Partner closely with data scientists, software engineers, and product managers to define data requirements, validate feature engineering approaches, and ensure model-ready data pipelines are reliable and low-latency
- Build and operate large-scale data warehouse solutions on AWS (Redshift, S3, Glue, EMR, Spark) supporting both batch and near-real-time workloads
- Establish and enforce data quality frameworks, monitoring, and alerting to ensure the reliability of data feeding autonomous operational systems where data errors carry real operational risk
- Define and implement data governance standards, access patterns, and documentation so that data assets are discoverable, trustworthy, and reusable across teams
- Mentor junior data engineers on best practices in pipeline design, code quality, testing, and data modeling
- Identify and eliminate bottlenecks in existing data infrastructure, continuously improving pipeline performance, cost efficiency, and maintainability
A day in the life
You start the morning reviewing pipeline health dashboards and triaging any data quality alerts before the ML team begins model training runs. Mid-morning, you join a working session with a data scientist to align on feature definitions for a new anomaly detection model — pulling sample data to validate assumptions together. After lunch, you spend focused time extending a near-real-time ingestion pipeline to support a new incident signal from a robotics domain team. You close the day in a design review with a senior engineer, walking through your proposed schema changes for a new consolidated incident data model. No two days are exactly the same, but every day your work is directly enabling a platform that keeps Amazon's fulfillment network running.
Amazon offers a full range of benefits that support you and eligible family members, including domestic partners. Benefits can vary by location, the number of regularly scheduled hours you work, length of employment, and job status such as seasonal or temporary employment. The benefits that generally apply to regular, full-time employees include:
1. Medical, Dental, and Vision Coverage
2. Maternity and Parental Leave Options
3. Paid Time Off (PTO)
4. 401(k) Plan
If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you! At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply!
About the team
The Infrastructure Reliability team sits within Amazon's Robotics organization and operates as the cross-domain orchestration layer for a fulfillment network that processes customer orders continuously across thousands of global sites. Our mission is straightforward and non-negotiable: operations never stop, no matter what breaks.
We do not own any single fulfillment domain — instead, we build the platform that sees across all of them, detecting failures that cross team boundaries and coordinating resolution faster than any single team could manage alone. We are now investing heavily in AI-powered detection, multi-agent remediation orchestration, and unified observability — moving from rule-based approaches toward LLM-powered autonomous resolution at scale.
We value technical rigor, customer obsession, and hands-on depth. We are a small team working on a large and growing problem, and every team member has meaningful influence over technical direction. If you want to work on something that is technically fascinating, operationally critical, and commercially enormous, this is the team for you.