Pruning
Pruning is an advanced optimization technique for Large Language Model (LLM), where we identify model weights / model layers in the model that contribute very less to the final output and remove them, this reduces the overall model size and also the cost of compute during model Inference, another aggressive method of pruning is where we also try to remove entire layers form the Neural Network that have little to no significance in the model response, this method is highly advanced and requires very deep understanding of the model, as any wrong step can severely affect the model response inference performance