by Arpit Kumar
18 Apr, 2023
6 minute read
Wafer Scale Engine for Training LLMs

Wafer scale engine provides a alternative approach of training large models using monolith chip that integrates and interconnects, onto a single piece of silicon


I was really excited about GPT4All because it was super easy to set up and I could run it right on my laptop. But the model, which is based on Llama, is only for research purposes. So, unfortunately, I couldn’t use it for any professional work.

Anyway, I decided to look for open-source models that have licenses that allow for fine-tuning and commercial use. I came across a few options like Open Assistant, Dolly, and Cerebras-GPT.

They’re all open-source and can be used for commercial work. The training approach of Cerebras-GPT is different from other open-source and close source models. While they might not be as accurate as ChatGPT for more complicated queries, they’re still really useful for simple tasks. Plus, they’ll probably get better over time and work for more and more things.

How expensive is it to train GPT-3

Cost of a single training session for GPT-3 is estimated to be around $1.4 million, and for some larger LLMs (Large Language Models), the training cost ranges from $2 million to $12 million. You need to train them on multiple nodes with multiple GPUs, which is crazy!  Cost of running them is also very high.



Challenges in GPU-based training

Normal GPU-based training for deep learning models can face a number of challenges, including:

  1. Memory limitations: GPUs have a limited amount of memory, which can be a bottleneck for large deep learning models that require a lot of memory. If the model doesn’t fit into the GPU memory, it will need to be split into smaller pieces, which can slow down the training process.
  2. Slow data transfer: In addition to memory limitations, transferring data to and from the GPU can also be slow. This can be a problem when working with large datasets or when doing distributed training across multiple GPUs or nodes.
  3. Bottlenecks in training: GPUs are designed to perform massive amounts of parallel computations, but not all parts of a deep learning model can be parallelized efficiently. This can create bottlenecks in the training process, where certain operations become a performance bottleneck.
  4. Hardware and software compatibility: Some deep learning models may require specific hardware configurations or software libraries that are not compatible with all GPUs or systems. This can limit the choice of hardware and software configurations available for deep learning.
  5. Power consumption: GPUs can be power-hungry, which can be a concern for large-scale training or for use in data centers.

Current approaches to train LLMs

There are several approaches for training LLMs (Large Language Models) using GPUs and distribution. Here are some common approaches:

  1. Single-GPU training: This is the simplest approach, where the model is trained on a single GPU. The model is usually too large to fit in the GPU’s memory, so the training data is split into batches, and the model weights are updated after each batch. This approach is suitable for smaller models or datasets.
  2. Data parallelism: In this approach, the model is replicated across multiple GPUs, and each GPU processes a subset of the training data in parallel. The gradients computed by each GPU are then combined to update the model weights. This approach is effective for scaling up to larger datasets and models.
  3. Model parallelism: In this approach, the model is split across multiple GPUs, with each GPU processing a different part of the model. This approach is effective for scaling up to larger models with a large number of parameters.
  4. Pipeline parallelism: In this approach, the training data is split into multiple parts, with each part being processed by a separate set of GPUs in a pipeline. This approach is effective for scaling up to larger models with a large number of layers.
  5. Distributed training: In this approach, the data and the model are distributed across multiple nodes in a cluster, with each node having one or more GPUs. The training process is coordinated across the nodes, with the gradients computed by each node being combined to update the model weights. This approach is effective for scaling up to very large datasets and models.
  6. Mixed-precision training: In this approach, the model weights and activations are stored using lower-precision data types, such as 16-bit floating-point numbers, instead of the usual 32-bit floating-point numbers. This allows for faster processing and reduces the memory requirements, enabling larger models to be trained on GPUs.

The choice of approach will depend on the specific requirements of the model and the available hardware resources.

Wafer Scale Engines

However a new class of computing systems Wafer-scale engines (WSEs) have come up that are designed to enable the training and deployment of large-scale deep neural networks (DNNs) in a more energy-efficient and scalable manner than traditional GPU-based distributed training.

A WSE is a large, monolithic chip that integrates hundreds or thousands of processing cores, as well as memory and interconnects, onto a single piece of silicon.


Wafer Style Engine
source @ cerebras - (https://www.cerebras.net/product-chip)

Hose WSE stand against GPU based training

Compared to GPU-based distributed training, which involves using multiple GPUs across different nodes in a cluster to train a neural network, WSEs offer several advantages. First, because all the processing cores are located on a single chip, there is much less latency and bandwidth overhead involved in communication between nodes. This means that WSEs can achieve much higher communication bandwidth and lower latency than distributed GPU-based systems, which can be a bottleneck in large-scale deep learning training.

Second, WSEs are much more energy-efficient than traditional GPUs because they use a custom-designed architecture that is optimized for neural network training. This means that they can achieve much higher performance per watt than traditional GPUs, which can significantly reduce the cost of training large-scale neural networks.

Wafer scale chips offer a number of advantages over traditional packaged chips, and they can help to solve several problems in the field of chip manufacturing and computing:

  1. Lower Cost: Wafer scale chips can reduce the cost of manufacturing by eliminating the need for individual packaging and assembly of each chip. This can make chip production more efficient and cost-effective.
  2. Higher Performance: Since wafer scale chips cover an entire wafer, they can provide higher performance than traditional chips. This is because wafer scale chips have more transistors, which can perform computations more quickly.
  3. Lower Power Consumption: Wafer scale chips can also be more power-efficient than traditional chips. This is because they can be designed to use less power for a given computational task.
  4. Reduced Interconnects: Wafer scale chips can reduce the number of interconnects required to connect individual chips, which can help to reduce signal delays and improve overall performance.

Overall, while distributed GPU-based training has been the dominant approach to training large-scale deep neural networks in recent years, WSEs offer a promising alternative that has the potential to significantly improve the efficiency and scalability of deep learning training. However, the technology is still relatively new and there are currently only a few companies working on developing WSEs, so it remains to be seen how widely adopted they will become in the coming years.

Recent Posts

Understanding Asynchronous I/O in Linux - io_uring
Explore the evolution of I/O multiplexing from `select(2)` to `epoll(7)`, culminating in the advanced io_uring framework
Building a Rate Limiter or RPM based Throttler for your API/worker
Building a simple rate limiter / throttler based on GCRA algorithm and redis script
MicroVMs, Isolates, Wasm, gVisor: A New Era of Virtualization
Exploring the evolution and nuances of serverless architectures, focusing on the emergence of MicroVMs as a solution for isolation, security, and agility. We will discuss the differences between containers and MicroVMs, their use cases in serverless setups, and highlights notable MicroVM implementations by various companies. Focusing on FirecrackerVM, V8 isolates, wasmruntime and gVisor.

Get the "Sum of bytes" newsletter in your inbox
No spam. Just the interesting tech which makes scale possible.