Computer Vision Application
You can find the code here on github.
This last month I started interviewing with the computer vision company Alegion and felt a little out of my depth. My experience with computer vision is next to nil, only really being exposed to it in a Udemy course that walked me through classifying images as either containing cats or dogs. I ended with some code that does an okay job but I didn’t really understand how I could translate that to video data.
So I went looking for other resources to learn more. I started with Analyics Vidhya and it was a perfect stepping stone and introduction to the topic. The tutorial covered reading (small) video files, extracting the frames, using the pretrained VGG16 model, inputting our data into the pretrained model, and creating predictions for a separate video file.
At this point I focused in on what I wanted to do with computer vision. My brother and I had been following Overwatch League for the last few years and had recently started getting together and watching the games. A constant source of annoyance for us has been how terrible the casters are at filling time. (This was especially insulting because the Starcraft casters were soooo much better.) I figured, what the hell, let’s try to apply computer vision to this problem. Let’s skip the casters filling time and any down time between matches.
I followed Vidhya’s tutorial pretty closely and wanted to move on to the next logical stage of my project where I would write out a video file. I got a little stuck here not realizing that I needed a different package other than cv2 to write audio, but after some false starts I found moviepy. Moviepy seemed like the perfect tool that wasn’t going to get as complicated as ffmpeg while still being able to do the tasks that I was looking for. The library home page included a gallery of example scripts and I found this example which was exactly what I needed. I followed along with their example and then moved to write my own. I ended up with a video that cut out about 20 minutes of video. I was pleased with my results but there were still some serious issues and misclassifications.
Above is a short video of the intro. Why did it keep all that? Why does it randomly skip, still to the casters talking? Overall, I was pleased I made something with computer vision, but it wasn’t very good. I went back to the drawing board and decided to drop the pretrained VGG16 model for a self-built model following a default approach set out in the Udemy course linked above. The model improves pretty drastically. In the video below, I show a clip where it is working pretty well, but we are skipping some of the start of gameplay chunks (and the end of some gameplay chunks.)
So, I added a window of 30 seconds around each clip. My next goal was to improve the amount of time it took to make predictions. My current code was based on the Analytics Vidhya example which used 5 minute clips. All told, that ended up with 300 frames held in memory while my average clip length is somewhere around 2 hours with something like 7000 frames which is back breakingly slow. I knew I could do something with a datagenerator and flow_from_directory, but I didn’t have any concrete examples to walk through.
I followed the documentation but ended up tripping over myself in a couple of different places. Firstly, I couldn’t get my generator to find any pictures because I didn’t read enough of the documentation to know that the image_generator.flow_from_directory still expects the images to be in a subdirectory even when they have no assigned class. It was an easy enough fix once I figured it out, but all of a sudden I was getting garbage predictions that didn’t make any sense. I kept comparing it to my prior memory intensive method and could understand what was happening. After some hair pulling and a lot of wasted backtracking I ran test_generator.filenames And uh-oh, I saw the problem and also duh! Of course it reads them in the order 0, 1, 10,100,1000,2,20… I needed to pad my numbers. I followed a stack exchange article to fix that problem and updated the code that cut up the frames accordingly so I wouldn’t have to do it again.
Success! My output video had maintained it’s quality and now ran much faster. In the above clip you can really see that 30 sec window before and after gameplay. The two major speedbumps are the training of the model, something I am only having to do each time since I keep adding to the training and validation data, and writing out the final video. I think I can get faster at writing out the final video using a callout to the commandline implementation of ffmpeg, but I have set that aside for the time being.
Instead, my next steps for this project are to
1) Make a callout to youtube.dl for the video file I want to predict so I can keep it all in this one notebook
2) Continue to add more training data - all of this data has been from the week 15 regular season. There could very well be different indicators of “not gameplay” in earlier weeks or years of the league.
3) Add in some sort of transition between cuts. As it stands right now, all of the cuts are hard and don’t look the best. I looked around and it seems that moviepy should be able to do it, but my one attempt thus far didn’t knit together all the way. This is probably my top priority even though it’s number three on the list.
4) I need to add weights to the data. Game play tends to happen a little less than twice as often, so the model should weigh correct predictions of not gameplay higher.
5) I probably should play around with both my smoothing function and my window of time on either side of a clip to keep. Just aesthetically, I bet I can make the clips feel smoother with more time on those values.