Senior Site Reliability Engineer (OCI)

OracleAustraliaUpdate time: May 25,2023

Job Description

Enterprise Engineering (EE) group in Oracle Cloud Infrastructure (OCI) organization is seeking a motivated Senior Site Reliability Engineer that thrives in a fast-paced rapidly evolving technology environment. This individual will be a member of the SRE Infrastructure services team and focused on driving for those quality standards across all of EE. The purpose of this position is to support build, operations, confidentiality, and integrity of analysis for Oracle employed personnel within secure facility. The primary purpose of the secure facility is for Oracle analysts to review source code as part of a security assurance project.

As part of the Operational Engagement programs you will be instrumental in fostering a culture of SRE for horizontal activities and DevOps for products and tools across our global operations teams. The team you work in will have diverse expertise in systems, networking, and software development to provide the stability, performance and reliability our customers need. We work with multiple service development teams, identifying cross-team issues which create risk for operations across the organization and resolving those issues with a mixture of engineering, troubleshooting expertise, and general operational guidance. Your role also requires communication and organizational skills. You are an interface between Devops Tools, application teams that implement OCI services. You will deliver the solutions that directly contribute to our internal customer’s success.

The role requires skills in the following areas : SRE/DEVOPS, Cloud infrastructure Virtual Networking ,Linux, CI/CD,Additional skill sets that are appreciated are Pythion , terraform , automation and knowledge of networking and services running on cloud platforms. The role’s primary focus is providing solutions for infrastructure and services by leveraging software development and industry standard solutions to automate many tasks required to enable and manage our offerings. In addition, this role is responsible for complex problem resolution, creating and improving procedures and facilitating communication. Other duties include researching, proofing, and authoring technical documentation that are beneficial to the company. This is a great career opportunity for a highly motivated individual who wants to extend and utilize his or her solid and broad skills.

Responsibilities will include working with a global team of SRE’s and developers to provide a complete solution. You will also work with other development teams to integrate multiple applications into a cohesive whole. End-to-end automation for deployment, configuration, monitoring, self-healing and alerting will be a continual challenge.

The team you work in will have diverse expertise in systems, networking, and software development to provide the stability, performance, and reliability for our customers need. We work with multiple service development teams, identifying cross-team issues which create risk for operations across the organization and resolving those issues with a mixture of engineering, automation, troubleshooting expertise, and general operational guidance. Your role also requires communication and organizational skills: you are an interface between Devops Tools, application teams that implement OCI services. You will deliver the solutions that directly contribute to our internal customer’s success.

What will you do

Support Analysts inside secure facility and help them troubleshoot with workstations/laptops
Building and supporting desktop hardware. This includes dismantling of desktop devices and performing boot install
Performing hardware tests and troubleshooting of new hardware
Develop cloud infrastructure automation, services, and tooling
Critical support for production environment
Develop and maintain different components, which includes Hybrid compute service spanning Networking, system hardware, software development and operations
Automate Cloud Infrastructure provisioning, maintenance, and administrative functions
Manage CI/CD pipelines
Incident management, deployments, monitoring, Automation , patching
Utilize a deep understanding of service topology and the dependencies required to troubleshoot issues and define mitigations.
Serve as part of a 24x7 On Call rotation in support of the infrastructure life cycle
Professional curiosity and a desire to a develop deep understanding of services and technologies.
Use your experience and wisdom from building & running systems and infrastructures as a multiplier to drive operational improvements, its Service Teams and its Services.
Use your excellent written & oral communication skills to ask pertinent questions, and to
Quickly grasp and analyze new or new-to-you systems that are complex and rapidly changing.
Educate yourself and others on anything that helps Service Teams more quickly and easily build, test, deploy & run their Services to be more reliable.
Identify problems and/or opportunities for improvements that are common across many teams/services.
Collaborate with other team members and stakeholders

Qualifications

Skills

Good Understanding of Cloud Infrastructure and Virtual Networking
Experience working in closely held/confidential environments
Proficient with maintaining CI/CD pipelines
System Administration including Linux and Windows internals, TCP/IP, DNS, Load balancing technologies
Good programming experience with Python including Object Oriented programming
Versatile in cloud related technologies.
Strong Cloud network experience
Adept in two of three areas: (1) Python, Go or Java, (2) Kubernetes and (3) Terraform
3-5 years’ experience in compute, network, storage, database troubleshooting for improving capacity, reliability, scalability, availability working as a site reliability engineer
Proficient with Git source code management (SCM)
Working with compute, network, storage, database, troubleshooting for improving capacity, reliability, scalability
OS image build for Linux, Windows and patch automation using Python, PowerShell
Cloud network experience
Familiar with Software Deployment and lifecycle in Cloud
Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems
Good understanding of Agile software development principles including using common tools such as JIRA
Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems
Aptitude to be a good team player and the desire to learn and implement new Cloud technologies as needed
Excellent organizational, verbal, and written communication skills
Good understanding of Agile software development principles including using common tools such as JIRA

General Qualifications

3-8 years of experience in SRE/DEVOPS
The work can be demanding at times, particularly as deadlines approach, when extra hours may be required based on the candidate's effective deliverable capacities.

Educational Qualifications

Bachelor’s or master’s degree in Computer Science or equivalent related field experience

Certifications Preferred

Python Certifications
Cloud Certifications - OCI Certified, AWS Certified, Kubernetes certified
Network Certifications – CCNA
OS Certifications - OEL certified, RHCE certified
Security Certifications - Cloud security certs

Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.

Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.

A BS or MS in Computer Science, or equivalent. Identifies solutions to knowledge of server hardware and software configuration, networking, standard internet services, scripting languages, cloud computing patterns, technology security and compliance. Experience running large scale customer facing web services. Identifies solutions to understanding of load balancing technologies and experience with development in programming languages, databases and big data stores, and container technologies. Work involves defining and documenting technical architecture of complex and highly scalable products. A minimum of 5+ years experience of running large scale customer facing web services.

Apply on Company Website See all jobs at Oracle

Get email alerts for the latest"Senior Site Reliability Engineer (OCI) jobs in Australia"

You can cancel email alerts at any time.

Send to a friend