In recent years, AI technologies have been deeply integrated into the education sector, driving the upgrading and innovation of educational tools while facilitating high-quality development in the education industry.
Baidu Intelligent Cloud has partnered with TAL Education Group ("TAL"), a trailblazer in implementing large language models (LLMs) in education, leveraging the high-performance Baidu AI Heterogeneous Computing Platform (Baidu AI Ocean) to power TAL's proprietary MathGPT. This collaboration accelerates the integration of LLMs into educational applications, driving intelligent transformation across the industry.
Developing a self-researched large model requires not only a strong algorithm and technology team platform but also a matching AI infrastructure, including high-performance computing platforms, storage systems, networks, scheduling frameworks, datasets, and more. Additionally, it requires a mature engineering platform capable of quickly initiating the entire R&D project and validating base models, so that subsequent iterations can be quickly driven by combining application scenarios, educational research data, and business feedback. During the training and inference stages of the large model, the enterprise must also have the ability to handle large-scale tasks, continuously improving resource utilization and task efficiency with the existing infrastructure to achieve the fastest possible deployment and activation of the self-developed large model.
In response to this, TAL collaborated with Baidu Intelligent Cloud, one of the first practitioners deepening the industrial direction of large models in China's AI industry, using Baidu AI Heterogeneous Computing Platform to create a high-performance professional AI infrastructure to provide AI support for the self-developed MathGPT, successfully addressing the early-stage challenges of processing flows in large model applications.
Based on the Baidu AI Heterogeneous Computing Platform, TAL can quickly and conveniently create kilocalorie-level training and inference task clusters. In terms of computing power, these clusters use typical heterogeneous computing power like the A800 and H800, supporting a maximum scale of 16,000 GPUs. In terms of storage, the clusters are suitable for large-scale deep learning training scenarios and can offer millisecond-level (300μs) latency with service availability no less than 99.95%. Additionally, the clusters support online elastic scaling, quickly achieving linear growth in capacity and throughput performance. In cooperation with TAL, the Baidu AI Heterogeneous Computing Platform can achieve a single cluster total of over 500TB, providing high-performance data read and transmission assurance for model and data loading of training tasks, significantly improving task timeliness.
For different large model training scenarios, the Baidu AI Heterogeneous Computing Platform optimizes aspects such as computing efficiency, memory strategies, and distributed parallel strategies, combined with the characteristics of high-performance networks, greatly enhancing the training performance of large language models. Various model sizes like the LLaMA2 series and the GLM series have achieved training benchmarks, with task acceleration rates above 90% and model power utilization (MFU, the ratio of matrix power consumed in one forward and backward computation of the model to machine power) between 60% and 70%. This greatly reduces training times under limited computational power conditions. For AI containers, the Baidu AI Heterogeneous Computing Platform achieves more flexible strategy scheduling and task orchestration, laying the foundation for further realizing mixed distribution of offline and online tasks and the joint scheduling and allocation of inference and training tasks.
Moreover, for key training tasks, the Baidu AI Heterogeneous Computing Platform supports data service capabilities upstream, conveniently helping TAL users retransfer data from overseas’s Hugging Face. During training, its visualization capabilities allow for comprehensive aggregation and statistics of metrics such as resource statistics and workloads through dashboard monitoring. Downstream inference tasks also benefit from packaged service capabilities, aiding TAL users with one-stop rapid deployment and activation of inference tasks.
Currently, the "MathGPT" trained with support from the Baidu AI Heterogeneous Computing Platform has been widely applied in TAL’s flagship learning hardware Xueersi Xpad and multiple business scenarios, providing users with a more intelligent experience.
In the future, Baidu Intelligent Cloud will continue to work with TAL, leveraging the technological power of large AI models to serve innovative educational scenarios and learning methods, creating intelligent, personalized educational technology products and solutions, contributing to the construction of a strong education nation.
This article was published in People’s Daily on April 29, 2024, page 12.