DevOps Engineer(GPU cluster) 运维开发工程师(GPU集群)_CI
Bosch GroupChangning quUpdate time: November 11,2021
Job Description

Company Description

- Develops proposals for implementation and designs scalable enterprise information architecture
- Participates in project workshops and presents designed solution
- Performs reviews and audits of existing solution, design and system architecture
- Performs profiling, troubleshooting of existing solutions

Job Description

·       Wording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.

·       Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)

·       Implementation of performance monitoring (e.g. dashboards)

·       Automation & Deployment (e.g. patch management, integration of new compute nodes into cluster)

·       Preparation and execution of maintenances for all clusters, e.g. for security updates, compatibility testing and rollout.

·       Resolution of user incidents via various channels, e.g. issues with GPU devices or scheduling system, user issues in cluster usage (e.g. access, compute jobs, software management.)

·       Software deployment and maintenance (e.g. new versions)

·       Sysadmin housekeeping tasks (config cleanup, etc.)

·       Build, expand, maintain knowledge base

·       作为博世全球GPU集群DevOps团队的一员,负责作为AI深度学习平台的GPU集群的持续开发与运维

·       在现有平台既有服务的基础上,开发新的功能模块(例如,restful API, 负载均衡等)

·       开发平台的性能监控等功能(例如,可视化面板)

·       自动化部署(例如,软件包管理,将新增计算节点接入到集群等)

·       博世全球各个GPU计算集群的运维。例如,安全包更新、兼容性测试、扩容等。

·       通过各种渠道支持用户,解决可能出现的问题。例如,GPU设备的问题、系统任务编排的问题、以及客户使用集群时可能出现的其他问题(访问、计算任务、软件管理…)。

·       软件包的开发与运维。例如,新版本迭代。

·       系统管理员的日常任务。例如,配置项的刷新等。

·       建立,并持续的维护、丰富共享知识库。

Qualifications

·       Major in Computer Science, Mathematics, Engineering, or relevant technical discipline (bachelor or master)

·       3+ Years of hands-on experience with Linux or HPC(High Performance Computing) and DevOps.

·       Deep knowledge in general Linux server administration, such as Linux system management, networking, security and container technologies.

·       Know-how in HPC stack knowledge, such as Batch scheduling (IBM LSF, Slurm, PBSPro or similar), Parallel file system, automated deployment, software management(anaconda, environmental modules)

·       Know-how in cloud technologies such as MS Azure, AWS.

·       Software development experience is a plus.

·       Know-How in GPU computing domain(CUDA, cuDNN, NCCL, tensorflow, pytorch, CST etc.) is a plus.

·       Good teamwork and cooperation with global team

·       Quick learner for new data technologies

·       English(read/write).

·       计算机科学、数学、工程或相关技术专业(本科或硕士学历)

·       3年以上在Linux, 高性能计算(HPC)或DevOps方面的实操经验

·       熟练掌握linux服务器管理,并深入了解相关知识。例如linux系统管理,网络,安全,以及容器技术等

·       掌握高性能计算(HPC)相关技术栈知识。例如任务编排(IBM LSF, Slurm, PBSPro等类似框架),并行文件系统,自动部署,软件管理(Anaconda)等

·       了解公有云相关知识。例如微软云、亚马逊云等。

·       具备GPU计算领域相关知识将作为加分项。例如CUDA, cuDNN, NCCL, tensorflow, pytorch, CST等。

·       能与全球团队较好的进行团队合作。

·       具备快速学习并掌握新数据技术的能力。

·       英语(读、写)

Get email alerts for the latest"DevOps Engineer(GPU cluster) 运维开发工程师(GPU集群)_CI jobs in Changning qu"