National University Corporation - Notice of request for submission of materialsIntegrated Infrastructure System for Simulation, Data, Learning, and Inference 1 Set

This procurement is covered by the WTO Agreement on Government Procurement, Japan-EU Economic Partnership Agreement or Japan-UK Comprehensive Economic Partnership Agreement.

Japanese

Publishing date Feb 03, 2025
Type of notice Notice of request for submission of materials
Procurement entity National University Corporation - Tokyo
Classification
0014 Office Machines & Automatic Data Processing Equipment
Summay of notice ⑴ Classification of the products to be procured : 14
⑵ Nature and quantity of the products to be rent : Integrated Infrastructure System for Simulation, Data, Learning, and Inference 1 Set
⑶ Type of the procurement : Rent
⑷ Basic requirements of the procurement :
A "Integrated Infrastructure System for Simulation, Data, Learning, and Inference" must satisfy the following specifications :
a This system consists of "Data and Inference Infrastructure (Infrastructure System for Data Exploitation Platform)" part, "Simulation and Learning Infrastructure" part, and Storage. Compute nodes and Storage must be able to be used transparently with each other.
B "Data and Inference Infrastructure" part of "Integrated Infrastructure System for Simulation, Data, Learning, and Inference" must satisfy the following specifications :
a It must satisfy the following hardware specifications :
① Compute nodes of the part consist of the general-purpose CPU nodes, the data analysis nodes, and the inference nodes. The nodes must be able to communicate outside the system with a total bandwidth of 800Gbps or more.
② Total memory bandwidth of the general-purpose CPU nodes must be 350TByte/sec. or more, and total memory capacity must be 100TiByte or more.
③ Total theoretical peak performance of the general-purpose CPU nodes (by double-precision floating point arithmetics) must be 3.0PFLOPS or more. Compute accelerators must not be employed.
④ Total memory capacity of compute accelerators in the data analysis nodes must be 8.2TiByte or more.
⑤ Total theoretical peak performance of compute accelerators in the data analysis nodes (by floating point arithmetics with FP4 or higher precision, without considering sparsity) must be 260PFLOPS or more.
⑥ Total memory capacity of compute accelerators in the inference nodes must be 8.2TiByte or more.
⑦ Total theoretical peak performance of compute accelerators in the inference nodes (by floating point arithmetics with FP4 or higher precision, without considering sparsity) must be 260PFLOPS or more.
⑧ Each compute node must be equipped with NVMe-connected SSD(s) with a physical capacity of 3.0TByte or more.
⑨ The network interface of the interconnect employed by compute node must be 400Gbps or more per node for the general-purpose CPU node, and 400Gbps or more per accelerator for the data analysis node and the inference node. The data on the main memory of the compute accelerator device must be directly transferred without accessing the main memory of the general-purpose CPU.
⑩ External network interfaces of compute nodes must be 100Gbps or more per node.
b It must satisfy the following software specifications :
① The system must include container management functionality based on Kubernetes. The system must provide a web portal for its administration.
② The system must provide a web portal for project management.
C "Simulation and Learning Infrastructure" part of "Integrated Infrastructure System for Simulation, Data, Learning, and Inference" must satisfy the following specifications :
a It must satisfy the following hardware specifications :
① Compute nodes of the part consist of the simulation nodes and the learning nodes.
② Total theoretical peak performance (by double precision floating point arithmetics) of the simulation nodes must be 250PFLOPS or more.
③ Total memory bandwidth of the simulation nodes must be 18PByte/sec or more.
④ Total theoretical peak performance (by floating point arithmetics with FP4 or higher precision, without considering sparsity) of the learning nodes must be 2.6EFLOPS or more.
⑤ The network interface of the interconnect employed by compute node must be 400Gbps or higher per accelerator. The data on the main memory of the compute accelerator device must be directly transferred without accessing the main memory of the general-purpose CPU.
⑥ Each compute node must be equipped with NVMe-connected SSD(s) with a physical capacity of 3.0TByte or more.
⑦ The simulation nodes and the learning nodes must be interconnected by a network with a bandwidth of 5.0TByte/sec or more.
b It must satisfy the following software specifications :
① A Linux operating system must run.
② Fortran 2008, C11, and C++17 languages must be supported including automatic SIMD vectorization function and OpenMP API (Version 4.5 or higher). For the compute accelerators, Fortran 2008, C11, and C++17 languages must be supported with automatic parallelization function, OpenACC API (Version 2.7 or higher), or OpenMP API (Version 5.0 or higher).
③ An MPI3.1 library must be provided.
④ Python language must be supported.
⑤ Highly optimized math libraries and learning libraries must be provided.
⑥ A batch job system must be provided. A job that simultaneously uses both the simulation nodes and the learning nodes must be executed.
⑦ A container system must be provided.
⑧ The management servers for "Simulation and Learning Infrastructure" part should be configured using the general-purpose CPU node group of the "Data and Inference Infrastructure" part.
D Storage of "Integrated Infrastructure System for Simulation, Data, Learning, and Inference" must satisfy the following specifications :
a It must satisfy the following hardware specifications :
① Fast Storage System must be highly reliable with 20PByte or more capacity. Fast Storage System must achieve the access performance from all the compute nodes of "Simulation and Learning Infrastructure" part with bandwidth of 1.2TB/s or more.
② Archive Storage System must be highly reliable, with 20PByte or more capacity for a storage system and 5 PByte or more capacity for a tape archive system. Archive Storage System must be available from other Supercomputer Systems in ITC/U.Tokyo.
b It must satisfy the following software specifications :
① The area on Fast Storage System must be mountable as parallel file system from all the computing nodes of this system, and POSIX access must be available. It must have file compression functions.
② The area on Fast Storage System must be accessible as AWS S3 (Amazon Web Service Simple Storage Service) compatible Object Storage by all the computing nodes of this system and from outside this system. It must have file compression functions. It is desirable that files can be referenced transparently with the above Item 1.
③ It must be possible to connect to the area on Fast Storage System as block devices using the NVMe-over-Fabrics protocol from the compute nodes in "Data and Inference Infrastructure" part.
④ The area on the Archive Storage System must be mountable as file system from all the compute nodes of this system, and POSIX access must be available. It must also be accessible as online storage service from outside this system. It must have file compression functions. It must be available for hierarchical management, including a tape archive system.
⑤ Storage System must have functions for managing user and group information. It must have functions for mapping to and providing "Data and Inference Infrastructure" part and "Simulation and Learning Infrastructure" part.
E Interconnect of "Integrated Infrastructure System for Simulation, Data, Learning, and Inference" must satisfy the following specifications :
a It must satisfy the following hardware specifications :
① The bandwidth between "Data and Inference Infrastructure" and "Computation and Learning Infrastructure" must be at least 3.5TByte/sec.
b It must satisfy the following software specifications :
① The computing nodes in "Simulation and Learning Infrastructure" part must be able to communicate with the outside of the system via the "Data and Inference Infrastructure" part.
F Overall maximum power consumption except the cooling system must be 4.5MVA or less. The power capacity, cooling facility, and system assembly are required to be carefully designed so that the system is kept cool even if CPU, accelerator, memory, and disks are fully and continuously operated. The cooling of general-purpose CPUs and accelerators must be water-cooled. The footprint of entire system except cooling system must be equal to or less than 370 square meters. The footprint of cooling system which is located outdoor must be equal to or less than 500 square meters.
⑸ Time limit for the submission of the requested material : 17 : 00 17 March, 2025
⑹ Contact point for the notice : WADA Kazuhiro, Accounting Team, Information Strategy Group, Information Systems Department, The University of Tokyo, 6-2-3 Kashiwanoha Kashiwa-shi Chiba-ken 277-0882 Japan, TEL 070-1531-4283