DevOps Engineer(GPU cluster) 运维开发工程师（GPU集群）_CI

Bosch GroupChangning quUpdate time: November 11,2021

Job Description

Company Description

- Develops proposals for implementation and designs scalable enterprise information architecture
- Participates in project workshops and presents designed solution
- Performs reviews and audits of existing solution, design and system architecture
- Performs profiling, troubleshooting of existing solutions

Job Description

· Wording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.

· Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)

· Implementation of performance monitoring (e.g. dashboards)

· Automation & Deployment (e.g. patch management, integration of new compute nodes into cluster)

· Preparation and execution of maintenances for all clusters, e.g. for security updates, compatibility testing and rollout.

· Resolution of user incidents via various channels, e.g. issues with GPU devices or scheduling system, user issues in cluster usage (e.g. access, compute jobs, software management.)

· Software deployment and maintenance (e.g. new versions)

· Sysadmin housekeeping tasks (config cleanup, etc.)

· Build, expand, maintain knowledge base

· 作为博世全球GPU集群DevOps团队的一员，负责作为AI深度学习平台的GPU集群的持续开发与运维

· 在现有平台既有服务的基础上，开发新的功能模块（例如，restful API, 负载均衡等）

· 开发平台的性能监控等功能（例如，可视化面板）

· 自动化部署（例如，软件包管理，将新增计算节点接入到集群等）

· 博世全球各个GPU计算集群的运维。例如，安全包更新、兼容性测试、扩容等。

· 通过各种渠道支持用户，解决可能出现的问题。例如，GPU设备的问题、系统任务编排的问题、以及客户使用集群时可能出现的其他问题（访问、计算任务、软件管理…）。

· 软件包的开发与运维。例如，新版本迭代。

· 系统管理员的日常任务。例如，配置项的刷新等。

· 建立，并持续的维护、丰富共享知识库。

Qualifications

· Major in Computer Science, Mathematics, Engineering, or relevant technical discipline (bachelor or master)

· 3+ Years of hands-on experience with Linux or HPC(High Performance Computing) and DevOps.

· Deep knowledge in general Linux server administration, such as Linux system management, networking, security and container technologies.

· Know-how in HPC stack knowledge, such as Batch scheduling (IBM LSF, Slurm, PBSPro or similar), Parallel file system, automated deployment, software management(anaconda, environmental modules)

· Know-how in cloud technologies such as MS Azure, AWS.

· Software development experience is a plus.

· Know-How in GPU computing domain(CUDA, cuDNN, NCCL, tensorflow, pytorch, CST etc.) is a plus.

· Good teamwork and cooperation with global team

· Quick learner for new data technologies

· English(read/write).

· 计算机科学、数学、工程或相关技术专业（本科或硕士学历）

· 3年以上在Linux, 高性能计算（HPC）或DevOps方面的实操经验

· 熟练掌握linux服务器管理，并深入了解相关知识。例如linux系统管理，网络，安全，以及容器技术等

· 掌握高性能计算（HPC）相关技术栈知识。例如任务编排（IBM LSF, Slurm, PBSPro等类似框架），并行文件系统，自动部署，软件管理（Anaconda）等

· 了解公有云相关知识。例如微软云、亚马逊云等。

· 具备GPU计算领域相关知识将作为加分项。例如CUDA, cuDNN, NCCL, tensorflow, pytorch, CST等。

· 能与全球团队较好的进行团队合作。

· 具备快速学习并掌握新数据技术的能力。

· 英语（读、写）

Apply on Company Website See all jobs at Bosch Group

Get email alerts for the latest"DevOps Engineer(GPU cluster) 运维开发工程师（GPU集群）_CI jobs in Changning qu"

You can cancel email alerts at any time.

Send to a friend

Company Description

Job Description

Qualifications