Company Description
- Develops proposals for implementation and designs scalable enterprise information architecture
- Participates in project workshops and presents designed solution
- Performs reviews and audits of existing solution, design and system architecture
- Performs profiling, troubleshooting of existing solutions
Job Description
· Wording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.
· Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)
· Implementation of performance monitoring (e.g. dashboards)
· Automation & Deployment (e.g. patch management, integration of new compute nodes into cluster)
· Preparation and execution of maintenances for all clusters, e.g. for security updates, compatibility testing and rollout.
· Resolution of user incidents via various channels, e.g. issues with GPU devices or scheduling system, user issues in cluster usage (e.g. access, compute jobs, software management.)
· Software deployment and maintenance (e.g. new versions)
· Sysadmin housekeeping tasks (config cleanup, etc.)
· Build, expand, maintain knowledge base
· 作为博世全球GPU集群DevOps团队的一员,负责作为AI深度学习平台的GPU集群的持续开发与运维
· 在现有平台既有服务的基础上,开发新的功能模块(例如,restful API, 负载均衡等)
· 开发平台的性能监控等功能(例如,可视化面板)
· 自动化部署(例如,软件包管理,将新增计算节点接入到集群等)
· 博世全球各个GPU计算集群的运维。例如,安全包更新、兼容性测试、扩容等。
· 通过各种渠道支持用户,解决可能出现的问题。例如,GPU设备的问题、系统任务编排的问题、以及客户使用集群时可能出现的其他问题(访问、计算任务、软件管理…)。
· 软件包的开发与运维。例如,新版本迭代。
· 系统管理员的日常任务。例如,配置项的刷新等。
· 建立,并持续的维护、丰富共享知识库。
Qualifications
· Major in Computer Science, Mathematics, Engineering, or relevant technical discipline (bachelor or master)
· 3+ Years of hands-on experience with Linux or HPC(High Performance Computing) and DevOps.
· Deep knowledge in general Linux server administration, such as Linux system management, networking, security and container technologies.
· Know-how in HPC stack knowledge, such as Batch scheduling (IBM LSF, Slurm, PBSPro or similar), Parallel file system, automated deployment, software management(anaconda, environmental modules)
· Know-how in cloud technologies such as MS Azure, AWS.
· Software development experience is a plus.
· Know-How in GPU computing domain(CUDA, cuDNN, NCCL, tensorflow, pytorch, CST etc.) is a plus.
· Good teamwork and cooperation with global team
· Quick learner for new data technologies
· English(read/write).
· 计算机科学、数学、工程或相关技术专业(本科或硕士学历)
· 3年以上在Linux, 高性能计算(HPC)或DevOps方面的实操经验
· 熟练掌握linux服务器管理,并深入了解相关知识。例如linux系统管理,网络,安全,以及容器技术等
· 掌握高性能计算(HPC)相关技术栈知识。例如任务编排(IBM LSF, Slurm, PBSPro等类似框架),并行文件系统,自动部署,软件管理(Anaconda)等
· 了解公有云相关知识。例如微软云、亚马逊云等。
· 具备GPU计算领域相关知识将作为加分项。例如CUDA, cuDNN, NCCL, tensorflow, pytorch, CST等。
· 能与全球团队较好的进行团队合作。
· 具备快速学习并掌握新数据技术的能力。
· 英语(读、写)
Get email alerts for the latest"DevOps Engineer(GPU cluster) 运维开发工程师(GPU集群)_CI jobs in Changning qu"