Building a deep learning model has always been a daunting topic for me. There is always so much to do. Before even starting to train the data, a lot has to be decided, from determining the number of I/O modules to deciding which activation and optimization to use and the worst, when and where to do what. In machine learning things are quite improved, all thanks to the sci-kit learning library, where after some pre-processing of the data, all you have to do is create the learner and then train and fit the data you in one step. predictions and find the accuracy. In just three or four cheerfully short and clean steps, you’re done with a simple machine learning model. But unfortunately things are not that simple for building deep learning models. Of course, there is a reason to choose this computationally extensive, complex architecture, as the extent of nonlinearity in your data and target can be too complex for an ML algorithm to learn and resolve. So I tried to outline the general steps in deep learning that every architect should go through regardless of architecture. Of course this is a general way of explaining, but sometimes a road map really helps for an absolute beginner. I have noted the highlights of the discussion in this article in a chart at the end and now let’s get started.
Get your data
As a beginner, I’ve mostly relied on Kaggle for deep learning data, and the best thing is that you can read the notebooks of what others have submitted if you’re not sure how to approach the data. But I would urge you to try to do it yourself first and save other people’s notebooks as a last resort if you can’t make head or tail of the data set or compare your work. If you are more familiar with machine learning (which you probably are) try working with a familiar dataset at first since you don’t have to deal with how to pre-process the data. UCI archive also offers some fantastic datasets for machine learning. Also Github is regularly updated with good datasets.
Data Cleaning and Analysis
The main cleanup involves how to deal with missing values. If your data set is quite large, say over 10,000, then as a beginner, I would suggest you to clearly delete the examples with missing data. Because sometimes the data rendering techniques to fill these gaps tend to mess up the training process later. But if you must, then you might fill in an average or a mode (if the column is categorical). If the columns are correlated, then you can use machine learning to estimate the missing value. Your next hit would be outliers in the data. Some of the professionals prefer to “fix” the value. Simply put, if the value is below 1.5 times the Inter Quartile Range (IQR) of the minimum, they will raise it to the minimum value, and if it is more than 1.5 times the IQR of the maximum, the value will be truncated to maximum price. But I would ask you to tread carefully here. While the lousy data entry operator might have added an extra ‘0’ to the price of Dizzyland’s 4BHK house and taken it to INR 5600K while the 3BHK in the neighborhood sold for just INR 450K, but a remedy that barely lengthened Mr. Gupta’s honor Life in just 3 weeks can give Mr. Shen 4 more years to his life, this is a fact and not a rumor. In short, what I want to say is that determining outliers may not be a simple task and in most cases requires some domain knowledge. So if your data set is too noisy and that’s hindering learning, I’d ask you to leave that set and try a new one, then revisit it later. Data analysis is important as you may want to leave some of the unnecessary columns to make learning easier. For example, if a Starbucks 250 yards from a house doesn’t contribute much to its price, then it can safely be ignored as a column. Finally, as with machine learning, this is ultimately a numbers game, so you’ll need to transform all your categorical data into numbers and also normalize and scale your data, and you’re free to use your old friend’s learning library to the job. There may also be some application-based preprocessing, such as padding or trimming in the case of text data, if you’re building a natural language processing application.
Create the Network
The simplest model will have an input layer, a hidden layer and an output layer. Since a hidden layer itself is linear, you should of course add an activation layer to add nonlinearity to your model. Also, say you are working with Multi layered Perceptron (MLP), the number of hidden layers and the number of hidden units in each is another pro parameter to be decided and there is no general rule. You should get your hands dirty, try and fail and try again. The first few days teaching your model as a new teacher is not a pretty process. But it will get easier with time. RNNs and CNNs will mostly be used as intermediate layers if your data is image or text, and the output will mostly be fed to an MLP output layer that will tell if the image was of a dog or a cat, or if a certain speech was offensive. So your output layer will have output units equal to the number of classes in the classification dataset, or one in case of regression.
Fix hyperparameters
Let’s say if you are intuitive enough not to struggle with these intermediate layers in your neural network, there are still a few parameters that need to be coded. First, of course, is the learning rate. As a golden rule, start with a small number, as AI and DL fans suggest. Therefore, you should decide on the optimizer to use to eliminate the empirical error in your data set. Choose from a variety of Gradient Descent, RMSProp or Adam type algorithms. Next comes the initialization of the parameters, in simple words, initializing the weights and biases of each individual layer. You won’t really have to think about whether you should fix the W matrix to 0.00001 for all values in the first level and keep that bias factor at zero, no, the designer angels of the deep learning libraries have it covered. There are off-the-shelf functions in popular frameworks like Keras and Pytorch. Typically, if you are using a sigmoid type of activation in the plane, use a Xavier or Glorot preparation, and if your choice is ReLU or its type, choose any of the variations of the He preparation. In short, the choice of initialization depends on the type of activation function you use, so that a sigmoidal curve does not become a straight line for bad initialization W, b. Then choose a loss function. For starters, MSE for regression and Cross Entropy for classification are quite good. Next, consider the lot size. After much trial and error, finally, the minibatch approach to optimize the loss function has been declared the best and is the generally accepted way of training the model. A very popular choice is 64, but don’t be afraid to try 32 or 128 or whatever you prefer. Last but not least, specify the number of epochs to run training on your data. An epoch is completed when the model finishes training on all examples in the data set once. So decide how many times to train your model on the dataset. Unlike your human pupil, this artificial pupil will only be refined to a limit. This number of times can be from 25 to 250 or even more or less than that. So again, there is no general rule here.
Start training
If you are using Pytorch, you will need to batch the train and test dataset. Next you should set the training mode. While the vanishing gradient problem is well addressed, you might want to define a function to take care of the exploding gradient. For example, you can go through my transformer model and get to the training section where a slope cutoff function is defined. Note the model’s accuracy per epoch to see if the model has already started to overfit (a way to also find out how many epochs are good enough for your cyber learner).
Now cross your fingers compile and run the model. Are you happy with the accuracy? If so, congratulate yourself and go get some fresh air. If not, don’t worry, you’re not alone. Let’s explore a little deeper.
A super-custom model : Time to tidy up your model. Add some leaks in the intermediate layers, add some weight decay to make the learning process more difficult. Both are hyperparameters and your intermediate layers already have it nulled, you just need to pass some values as a parameter where you set the layer (see the library documentation here).
Underfitting model : Add one or both more intermediate layers, increasing the number of hidden modules.
Difference : Well, there may be no explainable reason for this. If you are using a new data set, just check if the data can be better pre-processed, such as rescaling or further handling of outliers. Also change the batch size, learning rate. This, to be honest, cannot be explained well. So just revisit all the previous steps and see what changes can be made. You have to go slower here. Please do not make the changes at the same time.
While an entire book can be written in detail on this topic, I tried to add my two cents to it. I hope it helps you form an opinion on the matter.