Wednesday la Deuxiem

Ryan Harvey
5 min readFeb 10, 2021

So we are here in week 2 of the machine learning quarantine and I’m here to report my progress to date

in my previous entry I discussed the problems of balancing the data, and I was of the potentially naive impression that if I gave it more data, or added some random noise, I could fix the over-fitting problem.

This morning I implemented a random noise adder which adds random noise every time a given sample is called. I went through several versions of this implemention, eventually settling on the one that worked and was fast.

So what options did we go through?

I thought it would be wise to add the noise a whole megabatch before it was broken into batches and processed — my thinking was doing more matrix operations would be slower — unfortunately this does have its problems. Thus my first approach was to use a add gaussian noise class in the Torchvision Combine class

This implementation required I still iterate over the whole dataset in individual image tensors — which I wasn’t about — seemed slow.

So then I thought I’d generate and add random noise to the numpy array form of the Xy array before it got put into tensor format. This was slow and riddled with issues.

Sorry it;s tiny but who’s really going to read this…

Now the problem here is that the X tensor is of shape (5000,3,300,400) and numpy will not allow us to take more than 32 dimensions.

I tried swapping the 5000 and the 3 so I could perform the same operation with just 3 input dimensions with a sneaky np.reshape(3,5000,300,400). It wasn’t meant to be.

the X = X+noise step was asking for 13.8 GB of ram which, while I have 16gb of ram seemed excessive given the X vector should be about 1 GB why it would take so much ram I do not know.

There was a great deal of messing around with getting the thing to actually display an image after I added noise.

I then tried the same process but after the data was batched.

This had greater success but it doubled the time taken to perform the loop so I wasn’t keen. The hunt went on.

Finally, after building and rebuilding the solution several times I settled on the following implementation:

Honestly, this one is the solution I understand the least, but it’s fast and it does the thing so I don’t care.

Note the clip step — a convenient little function in pytorch which does the exact same thing as numpy.clip — it’s just nice to have.

the pre and post transform variables are so I could debug — which was useful.

X/255 was just a normalization step. My previous, more legitimate normalization step was causing issues when I tried to pass 4D tensors into the data transformation method so I binned it and did it this way. It adds about 5 seconds onto the 40 second mega_batch time — which I can cope with.

so that was fun

The next thing I wanted to do was add the option to balance the data.

Admittedly, my implementation here isn’t very modular but whatever — it works and its fast and easy to read so whatever.

I wanted the net to be trained on perfectly balanced data — just to see what would happen. I’ve been working with batches of 10 — there are 9 classes so training on a batch of 9 with one example of each class in the batch seemed perfectly reasonable. SO WHAT DO???

Well, the obvious goal was to take a very large vector of mixed classes of arbitrary distribution and turn it into N arrays of equal distribution.

The first step here was to find the class with the fewest instances. This would of course be our limiting factor when it came time to make our batches. We can only have N batches where N is the number of instances in our least frequent class.

To achieve that, we need to separate the big list into 9 small lists: I wrote the following function to do that:

I’m actually quite happy with this function. There’s probably a tidier way to make a list of empty lists, but it does get the process done in a fairly succinct way.

Anyway — separator separates a list of into 9 lists according to the Y values.

Now we will have a list of lists according to class — we can find the shortest list:

Short list finder does what it says on the tin — it finds the index of the shortest list in a list of lists. I couldn’t find a more succinct way of doing that.

Finally — here’s how the two functions play nicely together with another for loop to produce a nice batch:

The first for loop there just concatenates three large files into one such that we collate 15,000 samples and not 5,000 — this is to reduce the wasted images — one might expect larger imbalances in smaller files so I just used more data for the mega_batch_array.

We are seeing pretty serious data losses, although conveniently- while we always show the same samples from the smallest class — we randomly shuffle from the remaining classes each epoch — so there should be some benefit there.

The thing is training as we speak — it’s not looking amazing tbh but we’ll see. This is a much more difficult problem than was the previous and Shufflenet has been pretty good up to now.

--

--

Ryan Harvey

Food futures and technology consultant and writer. I use medium to write the things I can’t say on other people’s publications.