AWS Infrastructure Engineer - Zepz Florida, New York, United States Bookmark Share Print 58 0 0

Listing Description

About the role:

You are a Technical Support Engineer; passionate about our customers and their experience with Worldremit services. You have a breadth of technical knowledge (vs depth) and you are responsible for managing WorldRemit incidents and technical operations on our production platform. 

You understand our overall systems architecture and are able to drive swift resolution of incidents by coordinating and with various technical teams. (DevOps, Infrastructure, Engineering, Suppliers …) You have experience in automation and will drive improvement to streamline monitoring, alerting and incident resolution processes, in collaboration with our DevOps, SRE, and Engineering teams.

We use a modern DevOps and SRE tech stack –Jenkins, K8s, Harness, AppDynamics, Python, Terraform, and Agile working practices to get the job done.

As a member of Zepz Cloud Operations team you will aim high, embrace challenges and always do what’s right; acting with integrity and building trust as you contribute to the company’s technical direction and long term decision making.

What you will own:

Reporting to the Cloud Infrastructure Manager, you will:

  • Monitor our Production systems and react to alerts swiftly.

  • Ensure 24x7 availability of our product platform working with the Tech teams

  • Participate in the development of our monitoring & alerting strategies with the SRE team across multiple cloud environments, in particular AWS, using advanced monitoring tools like Grafana, AppDynamics and Splunk

  • Experience in cloud and on premise infrastructure, understanding the challenges and considerations to migrate workloads from on premise to AWS. Understanding of SQL, Windows server, Active Directory, DNS, VMWare, Networking skills and the willingness to research and advise on new technologies and developments.  

  • Manage incidents, categorization, triage, resolution and escalation 

  • Communicate appropriately with our business stakeholders on incidents (Customer Service…)

  • Participate in an oncall/shift rotation

  • Use code to solve problems. configuration, infrastructure, tooling, and automation, everything must be solved by writing high quality code that performs and scales.

  • Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.

  • Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.

  • Lead or be involved in the troubleshooting of complex incidents and problems.

  • Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.

  • Perform various Technical Operations in collaboration with the DevOps and Infrastructure teams (patching, log management, space management …)

  • Develop various technical runbooks in collaboration with other tech teams.

  • Participate in the continuous improvements of our operational processes (Incident, Problems, Change …)

  • Provide input in Post Incident Review / Post Mortem and take initiative in order to prevent and reduce incidents

What you bring to the table: 

  • A skilled Engineer. At least 5 years in Cloud Operations/Platform Services with a keen interest in solving problems using automation.

  • Understand SRE and DevOps methodologies. You understand the build and deployment cycle of an application, and how to operate a resilient system.

  • Strong experience in Incident & Event Management (NOC, App Support…)

  • Experience with support and troubleshooting of 24x7 high volume transactional Web applications,

  • Knowledge of Windows and Linux systems

  • Experience of Cloud infrastructure and platform services (we run on AWS)

  • Familiarity with terraform and IaC best practices 

  • Solid experience with GitHub or other  version control systems

  • APM systems such as Dynatrace, AppDynamics and/or New Relic

  • Alerting tool such as Grafana OnCall, PagerDuty, or OpsGenie

  • Experience in Scripting languages such as Python, Bash and PowerShell

  • Strong verbal and written communication skills.  Ability to take ownership of issues.

  • Systematic problem-solving approach. You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.

  • Happy in the Clouds. Our Cloud Native platform is hosted on AWS. You’ll be comfortable working with a system that supports users from around the world, at scale. Experience working for a Digital company, delivering real time transactional services (Finance/regulated) is preferred. 

  • Bias for action. You see a problem, you fix a problem. You get buy-in for your solutions and keep tickets moving. We’re always looking for ways to ship at pace.   

  • Growth mindset. A willingness to use your skills and experience to mentor less-experienced engineers. A desire to learn from others and make yourself better every day. 

  • Agile outlook. You need to be excited about working in a fast-changing environment. Products, tools, frameworks and processes change, we evolve and take the best bits with us. The teams drive the evolution.

Listing Details

  • Citizenship: Not Provided
  • Incentives: Not Provided


  • Education: Not Provided
  • Travel: Not Provided
  • Telework: Not Provided

About Us

AtmosJobs is a community-run job platform developed by SaaS professionals. Our unique approach of focusing strictly on Cloud positions allows us to personalize the user experience.

Our Contacts

1765 Greensboro Station Pl.
Suite 900
Tysons Corner Va 22102

(703) 594-7765