Deep Learning; Personal Notes Part 1 Lesson 3: CNN theory.

August 21, 2018

This blog post series will be updated as I have a second take on the fast ai lessons. These are my personal notes; a strive to understand things clearly and explain them well. Nothing new, only living up this blog.

Quick Dogs Vs Cats

Here is an end to end process to get a state of the art result for dogs vs. cats:

PATH = "data/dogscats/"

We assume that your data is in the data folder. But you may want to put them somewhere else. In that case you use symbolic link or symlink for short.

Note: We did not set pre_compue=True. It is a shortcut which caches some of the intermediate steps that do not have to be recalculated each time and can be left out. When pre_compute=True, data augmentation does not work.

learn.unfreeze()
learn.bn_freeze(True)
%time learn.fit([1e-5, 1e-4,1e-2], 1, cycle_len=1)

bn_freeze — If you are using a bigger deeper model like ResNet50 or ResNext101 i.e anything with number bigger than 34 on a dataset that is very similar to ImageNet i.e. side-on photos of standard object whose size is similar to ImageNet between 200–500 pixels, you should add this line. It causes the batch normalization moving averages not to be updated.

Using Other Libraries — Keras

Just like fast ai sits on top of pytorch, keras sits on top of TensorFlow, MXNet, CNTK, etc.

You need to install Keras or tensorflow as a backend:

pip install tensorflow-gpu keras

Imports:

import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing import image
from keras.layers import Dropout, Flatten, Dense
from keras.applications import ResNet50
from keras.models import Model, Sequential
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
from keras.applications.resnet50 import preprocess_input

Data path:

PATH = "data/dogscats/"
sz=224
batch_size=64train_data_dir = f'{PATH}train'
validation_data_dir = f'{PATH}valid'

Keras uses the idea of train folder and validation folder with subfolders with the label names.

Keras requires much more code and many more parameters to be set.

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
shear_range=0.2, zoom_range=0.2, horizontal_flip=True)test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)train_generator = train_datagen.flow_from_directory(train_data_dir,
target_size=(sz, sz),
batch_size=batch_size, class_mode='binary')validation_generator = test_datagen.flow_from_directory(validation_data_dir,
shuffle=False,
target_size=(sz, sz),
batch_size=batch_size, class_mode='binary')

Rather than creating a single data object, in Keras you define DataGenerator which specifies how to generate the data also specify what kind of data augmentation we want it to do (shear_range=0.2, zoom_range=0.2, horizontal_flip=True) and also what kind of normalization (preprocessing_function=preprocess_input)to do. In other words, in Fast.ai, we can just say “whatever ResNet50 requires, just do that for me please” but in Keras, you need to know what is expected. There is no standard set of augmentations.

train_generator — generates images by looking from a directory, setting the size of images and the size of the mini batches and class. When training you randomly reorder the images show that they are shown in different order to make them random by shuffling.

class_mode= ‘binary’ — if you have 2 possible out comes use binary if multiple say ‘categorical’

train_generator — generates images by looking from a directory, setting the size of images and the size of the mini batches.

Then create a validation data generator validation_generator where the generator does not have data augmentation. And also tell it not to shuffle the dataset for validation because otherwise you cannot keep track of how well you are doing.

Creating a model

base_model = ResNet50(weights='imagenet', include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)

We use ResNet50 as keras does not have ResNet34. For keras you cannot tell it to create a model suitable for a particular dataset. You have to do it by hand.

First create a base_model then construct layers you want to add to it i.e x in this case we add 3 layers.

Freezing Layers and Compiling a model

model = Model(inputs=base_model.input, outputs=predictions)for layer in base_model.layers: layer.trainable = Falsemodel.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

layer.trainable=False — loop through the layers and freeze using .trainable = False

Compiling a model by Passing the optimizer you want to use, the loss to look for and the metric to use.

Fitting

%%time
model.fit_generator(train_generator, train_generator.n // batch_size, epochs=3, workers=4,
validation_data=validation_generator, validation_steps=validation_generator.n // batch_size)

Call fit_generator and pass the train_generator and the validation _generator

For keras expects you to tell it how many batches there are per epoch , so the number of bathes = size is the generator divide by batch size, tell it how many epochs, and also how many workers i.e processors.

Fine Tunning

Unfreeze some layers, compile, then fit again.

There is no concept of layer groups or differential learning rates or partial un freezing you have to print out all the layers and decide how many you want to fine tune. We fine tune from 140 onwards.

split_at = 140for layer in model.layers[:split_at]: layer.trainable = False
for layer in model.layers[split_at:]: layer.trainable = Truemodel.compile(optimizer='rmsprop', loss='binary_crossentropy',
metrics=['accuracy'])%%time
model.fit_generator(train_generator, train_generator.n // batch_size, epochs=1, workers=3,
validation_data=validation_generator, validation_steps=validation_generator.n // batch_size)

After fine tuning you have to re compile the model then fit.

Submitting results to Kaggle

In a kaggle competition the is a part called evaluation that describes how the competition will be evaluated:

For datasets where the lables are in different folders use ImageClassifierData.from_paths. If you have aCSV file with labels you use ImageClassifierData.from_csv.

To create a submission you need to use:

data.classes : contains all the different classes

data.test_ds.fnames : contains test file names

It is a good idea to use Test time augmentation (TTA). By using is_test=True it will give you predictions on the test set rather than the validation set.

log_preds, y = learn.TTA(is_test=True)
probs = np.exp(log_preds)

Most PyTorch models will give you back the log of the predictions, so you need to do np.exp(log_preds) to get the probability.

probs.shape #(n_images, n_classes(breeds))
(10357, 120)

Converting the matrix to the kaggle format we use a pandas dataframe:

df = pd.DataFrame(probs)
df.columns = data.classes

Create a pandas dataframe and pass the matrix(probs). Set the names of the columns to be data.classes

df.insert(0, 'id', [o[5:-4] for o in data.test_ds.fnames])

Insert a new column at position zero named id that contains the file names. This is how the file name looks like. test/ab2520c527e61f197be228208af48191.jpg’ . Remove the first 5 and last 4 letters to get the IDs.

Looking at the dataframe:

SUBM = f'{PATH}/subm/'
os.makedirs(SUBM, exist_ok=True)
df.to_csv(f'{SUBM}subm.gz', compression='gzip', index=False)

Call df.to_csv to create a CSV file and compress it using compression=’gzip’. This saves the file to the serve. You can use kaggle CLI to submit by using this command$ kg submissions or download it to your computer and upload it to kaggle to do that use FileLink(f’{SUBM}subm.gz’). Which will give you a link to download the file from the server to your computer.

Individual Prediction

Running a single image through a model to get a prediction.

opening the first file in the validation set:

fn = data.val_ds.fnames[0]
fn
'train/000bec180eb18c7604dcecc8fe0dba07.jpg'
Image.open(PATH + fn).resize((150, 150))

Running the prediction:

trn_tfms, val_tfms = tfms_from_model(arch, sz)
im = val_tfms(open_image(PATH + fn)) # open_image() returns numpy.ndarray
preds = learn.predict_array(im[None])
np.argmax(preds)

The image must be transformed. tfms_from_model returns training transforms and validation transforms. In this case, we will use validation transform.

Everything that gets passed to or returned from a model is generally assumed to be in a mini-batch. Here we only have one image, but we have to turn that into a mini-batch of a single image. In other words, we need to create a tensor that is not just [rows, columns, channels] , but [number of images, rows, columns, channels]. By indexing into an array with im[None]that adds an additional unit axis to the start thus turning it from an image into a mini batch of images.

Theory On Convolutional Neural Network

Here is a Convolution neural network visualisation video by Otavio Good creator of Word Lens.

If you input an image the computer recognises them as numbers(pixels).

Input Layer

This number 7 image data is from the MNIST database and we are assuming you are using a pre-train model for the classification.

Hidden Layer 1

A hidden layer is what transforms the inputs to discern more complex features from the data for the output layer to make a better assessment.

We apply a filter/kernel that detects horizontal edges mostly 3X3. Note, the 1’s on top and 0’s in the middle and -1’s at the bottom of kernel A:

1, 1, 1
0, 0, 0
-1, -1, -1

If the filter is multiplied with the input we allocate high numbers for the high numbers as the are multiplied by 1 and almost nothing for the low numbers as they are multiplied by 0 or -1. Thus giving us the output of the first convolution called an activation which is basically a number that is calculated by taking some number from the input then applying some kind of linear operation i.e a convolutional kernel to calculate an output(the activation).

Conv1 shows the activations of both after taking the 3x3 section of the input and multiplying it by the convolutional kernel.

we assume the network is trained and at the end of the training it created a convolutional filter with A which has 9 numbers in it.

Convolution is something where we have a little matrix nearly always 3x3 in deep learning and multiply every element of that matrix by every element of 3x3 section of an image and add them all together to get the result of that convolution at one point.

let’s apply a second convolutional filter B which also has 9 numbers. Filter B finds the vertical edge and outputs it as a hidden layer.

1, 0, -1
1, 0, -1
1, 0, -1

Pytorch does not store them as 2 separate 9 digit arrays, it stores it as tensors. which are multidimensional arrays. The additional axis allows us to stack additional filters together. Filter and kernel mean the same it refers to the 3x3 matrix.

How convolution work: Activation -3 equals the sum of matrix product of the kernel and input

The hidden layer (conv1) is of size two as it has 2 filters/kernels.

Next we will apply another filter i.e filter C in hidden layer 2 which is two 2x3x3 kernel then apply rectified linear unit(ReLu) to eliminate the negatives. Thus creating a second hidden layer conv2.

Rectified Linear Unit (ReLU) -is a non-linearity that throw away the negatives.

Architecture

An architecture means how big is your kernel at layer one, how many filters are in your kernel at layer one e.t.c. We have a 3x3 in layer one and a 3x3 in layer 2, this architecture starts out with two 3x3 convolutional kernel. The second layer has two 2x3x3 kernels.

Max pooling

convolved feature vs pooled feature

It means replacing the highest value in an output matrix with the maximum. e.g. 2x2 max pooling. Note: it is non overlapping thus reducing the resolution.

Below is the max pooled layer:

The result of the max pool get fit onto dense weights a fully connected layer.

Fully connected layer picks every single max pooled activations and gives them a weight and picks the sum product of the max pooled activations and the weights of the 2 levels of the 3 dimensional tensor. to out put a dense activation

This is different from a convolution, in a convolution we go through activation a few at a time by using 3x3 kernels . But in a fully connected layer we create a big weight matrix equal to the entire input.

Architectures that make heavy use of fully connected layers can have a lot of weights therefore can have trouble with overfitting and being slow. For Example VGG has upto 19 layers, contains a fully connected layers with 4096 weights connected to 4096 hidden layer activations thus (4096 x 4096 x number of kernels) which is almost 300Million weights.

what if we had 3 channels of the input what would be the shape of the filter? if we had 3 channels of the input it would look exactly like conv1 which has 2 channels therefore 2 filters thus 2 filters per channels.

Rather than starting out with carefully trained filters when we are training from scratch we start out with random numbers and use stochastic gradient descent to improve those numbers and make them less random.

In practice we have to calculate more than one number for the ten digits. we would have ten dense activations.

Softmax

Predicting if an image is a cat, dog, plane, fish or building.

Out of the fully connected layer we would have 5 numbers. Note that in the last layer there is no Relu so we can have negatives. You want to turn the 5 numbers into probabilities from 0 to 1 and match it if it is a cat, dog, plane, fish or building. The probabilities of the 5 classes should be between 0 and 1 and all of them should sum up to 1. To do that we use an activation function this is a function applied to activations. it takes in one number and spits out another number.

We need to stack linear and non-linear function together to do deep learning. The non- linearity we are using after each hidden layer is a ReLu. An activation function is a non-linear function.

Softmax activation occurs in the final layer. softmax spits out numbers between _0_ and _1_ that also sum up to _1_.

To make softmax work we need to get rid of all the negatives. we will use logarithms and exponents. They appear a lot of machine and deep learning.

We do e^-0.36 to get the exp then add all of them up to get 51.31. To get the softmax we divide by 51.31 and get 0.01. The sum of the softmax should add up to 1.

A characteristic of the exp is that if a number(output) is slightly bigger than the other it tends to make the exp even much more bigger. The softmax tends to pick one thing strongly i.e it allocates to one class a high probability.

What kind of activation function do we use if we want to classify the picture as cat and dog? Softmax does not like to predicting multiple things. It wants to pick one thing. One reason we might want to do that is to do multi-label classification.

Planet Competition: Understanding the Amazon from Space

For the cat vs dog competition single-label classification the image is either a cat or a a dog not neither or both. For the Satellite competition the images are classified by weather (haze and clear), agriculture, primary rain forest and water(river). In this case we need to predict multiple things and softmax would not be great as it wants to pick one thing.

anthropomorphizing your activation functions (give personalities) softmax tends to pick one particular thing.

Fast.ai library will automatically switch into multi-label mode if there is more than one label. So you do not have to do anything. But here is what happens behind the scene:

from planet import f2 metrics=[f2]
f_model = resnet34label_csv = f'**{PATH}**train_v2.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n)def get_data(sz):
tfms = tfms_from_model(f_model, sz,aug_tfms=transforms_top_down,
max_zoom=1.05)
return ImageClassifierData.from_csv(PATH, 'train-jpg',label_csv, tfms=tfms,suffix='.jpg', val_idxs=val_idxs, test_name='test-jpg')data = get_data(256) #gets images of 256x256

we use from_csv since multi-label classification cannot be done with Keras style approach where subfolder is the name of the label.

transform_top_down : it does more than just a vertical flip. There are 8 possible symmetries for a square. It can be rotated through 0, 90, 180, 270 degrees and for each of those, it can be flipped (dihedral group of eight)

x,y = next(iter(data.val_dl))

Using ds (dataset) we can get an individual image_._

dl is a data loader which will give you a mini-batch, specifically transformed mini-batch. With a data loader, you cannot ask for a particular mini-batch, you can only get back the next mini-batch. In Python, it is called “generator” or “iterator”.

PyTorch really leverages modern Python methodologies. if you know python well, PyTorch comes very naturally. If you don’t know python well, PyTorch is a good reason to learn python well.

x is a mini-batch of images, y is a mini-batch of labels. Printing y:

This gives you a batch size of 64 by default and 17 which are the possible classes.

lets look at the 0 image labels:

list(zip(data.classes, y[0]))

[('agriculture', 1.0),
('artisinal_mine', 0.0),
('bare_ground', 0.0),
('blooming', 0.0),
('blow_down', 0.0),
('clear', 1.0),
('cloudy', 0.0),
('conventional_mine', 0.0),
('cultivation', 0.0),
('habitation', 0.0),
('haze', 0.0),
('partly_cloudy', 0.0),
('primary', 1.0),
('road', 0.0),
('selective_logging', 0.0),
('slash_burn', 1.0),
('water', 1.0)]

It’s agriculture, clear, primary slash_burn and water.

PyTorch and fast.ai are turn labels into one-hot-encoded labels. If the actual label is dog, it will look like this:

we take the softmax and compare with the actuals to actually predict. The difference between actuals and softmax summed up together give us the error i.e. loss function.

One-hot-encoding is terribly inefficient for sorting, so we will store an index value (single integer) rather than 0’s and 1’s for the target value (y).If you look at the y values for the dog breeds competition, you won’t actually see a big lists of 1’s and 0's, but you will see a single integer.

PyTorch internally converts the index to one-hot-encoded vector (even though you will literally never see it). PyTorch has different loss functions for ones that are one hot encoded and others that are not — but these details are hidden by the fast.ai library so you do not have to worry about it. But the cool thing to realize is that we are doing exactly the same thing for both single label classification and multi label classification.

Does it make sense to change the base of log for softmax? No, changing the base is just a linear scaling which neural net can learn easily:

Note: images are just matrices of numbers. The image was washed out, so making it more visible (“brightening it up a bit”) we multiply by *1.4.

It is good to experiment images like this because these images are not at all like ImageNet. The vast majority of things you do involving convolutional neural net will not actually be anything like ImageNet they will be medical imaging, classifying different kinds of steel tube, satellite images, etc

sz=64data = get_data(sz)data = data.resize(int(sz*1.3), 'tmp')

We will not use sz=64 for cats and dogs competition because we started with pre-trained ImageNet network which starts off nearly perfect. If we re-trained the whole set with 64 by 64 images, we would destroy the weights that are already very good. Remember, most of ImageNet models are trained with 224 by 224 or 299 by 299 images.

There are no images in image net that look like the satellite image above. The only useful bit of the imageNet are the first layer that finds edges and gradients, the second layer that finds texture and repeating patterns.

Starting out by first training with small images works well when using satellite images.

learn = ConvLearner.pretrained(f_model, data, metrics=metrics)lrf=learn.lr_find()
learn.sched.plot()

fitting

[lr/9, lr/3, lr]is used because the images are unlike ImageNet image and earlier layers are probably not as close to what they need to be.

plotting the loss:

lets double the size of our image size to 128 first then fit, then double again to 256 it gets you to 93.5% accuracy:

Test time augmentation get us to 93.6%:

data = data.resize(int(sz*1.3), 'tmp')

What does this do?

when we specify what transforms to apply, we pass a size:

tfms = tfms_from_model(f_model, sz,
aug_tfms=transforms_top_down, max_zoom=1.05)

One thing the data loader does is to resize the images on-demand. This has nothing to do with data.resize . If the initial input image is 1000 by 1000, reading that JPEG and resizing it to 64 by 64 takes more time than training the convolutional net.

data.resize tells it that we will not use images bigger than sz*1.3 so go through once and create new JPEGs of this size, which are rectangular, so new JPEGs whose smallest edge is of sizesz*1.3 center-cropped. This works like the batch scripts that loop through to resize images. It will save you a lot of time. data.resize is a speed up convenience function.

metrics=[f2]

Instead of accuacy, we used F-beta, it is a way of weighing false negatives and false positives. The reason we are using it is because this particular Kaggle competition wants to use it. Take a look at planet.py to see how you can create your own metrics function using Skitlearn. This is what gets printed out at the end . After every epoch[ 0. 0.08932 0.08218 **0.9324** ]

Activation function for multi-label classification

Fast.ai checks the labels to see whether an image has more than one label to it and picks the activation function automatically.

Activation function for multi-label classification is called sigmoid.

To find the sigmoid pick the exp divide it by 1+exp i.e:

0.01/1+0.01 = 0.01
0.08/1+0.08 = 0.07

This enable multiple things to be predicted at once.If something is less than 0 it’s sigmoid is less than 0.5 if more than 0 it is greater than 0.5.

Why don’t we start training with differential learning rate rather than training the last layers alone? You can skip training just the last layer and go straight to differential learning rates, but you probably do not want to. Convolutional layers all contain pre-trained weights, so they are not random — for things that are close to ImageNet, they are really good; for things that are not close to ImageNet, they are better than nothing. All of our fully connected layers, however, are totally random. Therefore, you would always want to make the fully connected weights better than random by training them a bit first. Otherwise if you go straight to unfreeze, then you are actually going to be fiddling around with those early layer weights when the later ones are still random — which is probably not what you want.

_When you unfreeze what are you trying to change? T_he kernel(filter) and weights. what training means is setting the filters and the dense weights. the dense weights are for the fully connected layer and the filters/kernel weights for the convolution. Activations on the other hand are calculated from the weights and previous layers activations or inputs.

_How do you get the 64 by 64 image size? T_his depends on the transforms by default our transform takes the smallest edges, zooms it out and pics a centre crop for it, but when using data augmentation it picks it randomly.

Accuracy is not what the model tries to optimise. It optimises the loss function e.g cross entropy the metric is what is printed out for us to see what is going on.

When you use the differential learning rates, do those three learning rates spread evenly across the layers? fast.ai library has a concept of “layer groups”. In something like ResNet50, there are hundreds of layers and you probably do not want to write hundreds of learning rates, so the library decids for you how to split them and the last one always refers to just the fully connected layers that we have randomly initialized and added the rest learning rates are split halfway through the layers.

Visualizing the layers

learn.summary()[('Conv2d-1',
OrderedDict([('input_shape', [-1, 3, 64, 64]),
('output_shape', [-1, 64, 32, 32]),
('trainable', False),
('nb_params', 9408)])),
('BatchNorm2d-2',
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 32, 32]),
('trainable', False),
('nb_params', 128)])),
('ReLU-3',
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 32, 32]),
('nb_params', 0)])),
('MaxPool2d-4',
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 16, 16]),
('nb_params', 0)])),
('Conv2d-5',
OrderedDict([('input_shape', [-1, 64, 16, 16]),
('output_shape', [-1, 64, 16, 16]),
('trainable', False),
('nb_params', 36864)]))
...

conv2d-1 is the name of the layer.

‘input_shape’, [-1, **3, 64, 64**] — PyTorch lists channel 3 before the image size 64, 64. Some of GPUs computations run faster when it is in that order. This is a 4 dimensional mini batch.

-1 means however big the batch size is, which can be changed. Keras uses None

‘output_shape’ , [-1, **64**, **32, 32**] 64 is the number of kernels, 32 by 32 is the stride. works like maxpooling it changes the size.

Learning rate finder for a very small dataset returned strange numbers and the plot was empty. The learning rate finder will go through a mini-batch at a time. If you have a tiny dataset, there is just not enough mini-batches. So the trick is to make your batch size very small.

Structured and Time Series Data

There are two types of dataset.

Unstructured — Audio, images, natural language text where all of the things inside an object are all the same kind of things — pixels, amplitude of waveform, or words.

Structured — Profit and loss statement, information about a Facebook user where each column is structurally different. “Structured” refers to columnar data as you might find in a database or a spreadsheet where different columns represent different kinds of things, and each row represents an observation.

Structured data is often ignored in academics because it is pretty hard to get published in fancy conference proceedings if you have a better logistics model. But it makes the world go round, makes everybody money and efficiency. We will not ignore it because we are doing practical deep learning, and Kaggle does not either because people put prize money up on Kaggle to solve real-world problems.

The motivation behind exploring this architecture is it’s relevance to real-world application. Most data used for decision making day-to-day in industry is structured or time series data. We will explore the end-to-end process of using neural networks with practical structured data problems.

Example :

predict sales for a large grocery chain. here

Forecast sales using store, promotions and competitor data. here

Rossmann Store Sale

Imports:

from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

PATH='data/rossmann/'

structured — not PyTorch specific and also used in machine learning course doing random forests with no PyTorch at all. It can used on its own without any of the other parts of Fast.ai library.

fastai.column_data — allows us to do Fast.ai and PyTorch stuff with columnar structured data.

For structured data we need to use Pandas a lot. Not familiar with Pandas, here (Python for Data Analysis)is a good book by the pandas author.

Create datasets

In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition for example google trends and weather . You can download all of them here.

There is a lot of data pre-processing, this notebook contains the entire pipeline from the third place winner (Entity Embeddings of Categorical Variables). Data processing is not covered in this course, but is covered in machine learning course in some detail because feature engineering is very important.

Looking at CSV files:

StoreType — you often get datasets where some columns contain a “code”. It really does not matter what the code means. Stay away from learning too much about it and see what the data says first. Then later look at the data dictionary often provided with the data.

Joining tables

This is a relational dataset, and you have join quite a few tables together — which is easy to do with Pandas using merge:

fast.ai also provides the following:

add_datepart(train, "Date", drop=False)

Takes a date and pulls out a bunch of columns such as “day of week”, “start of a quarter”, “month of year” and so on and adds them all to the dataset.

Duration section will calculate things like how long until the next holiday, how long it has been since the last holiday, etc.

Saving the whole structured file that contains the data:

joined.to_feather(f'{PATH}joined')

to_feather : Saves a Pandas’ data frame into a “feather” format which takes it as it sits in RAM and dumps it to the disk. So it is really really fast. Ecuadorian grocery competition has 350 million records, so you will care about how long it takes to save using feather it takes 6 sec.

The data contains how many of a particular item was sold on a particular date in a store. The goal is to predict how many of those items will be sold on a particular store at a future date.

Next Lesson

We split columns into two types: categorical and continuous.

Categorical variable: is that store_id 1 and store_id 2 are not numerically related to each other. they are categories. Days of the week; Monday (day 0) and Tuesday (day 1) e.t.c. Categorical data will be one hot encoded, and

Continuous variable: Things like distance in kilometers to the nearest competitor is a number we treat numerically. Continuous data will get fed into fully connected layer as it is.

steps:

Create a validation set.
ColumnarModelData.from_data_frame is how we load column data. the basic API concept are the same as of image recognition.
.get_learner
.lr_find() to find our best learning rate and plot it.
.fit with a metric
.fit with a cycle_len

Thanks for reading! follow @itsmuriuki.

Back to learning!