(UBC ENPH 353 Course Report)
Keywords: Robot Operating System (ROS), Keras, OpenCV
Hello all. Welcome. This is half a log book for the ENPH 353 course. It will focus on my contribution to the project (the computer vision portion). Shoutout to Gosha Maruzhenko for providing navigation and making sure we don’t hit any pedestrians.
This is a brand new course at UBC! So firstly I would like to provide special thanks for Miti and Griffin at UBC for setting everything up. I have learned a lot in this course.
The goal of ENPH353 is to design a robot to navigate virtual environments and read license plates using machine learning. Fancy. Stay on the road, don’t hit the pedestrians, you know the drill.
Apparently UBC Engineering Physics and MIT share a few course designs (ENPH 253, for example). For those who are familiar with MIT’s “Duckietown”, our task is quite similar to that. We even use the same framework to control our robots (ROS). The difference is that our track is not physical, and is instead modelled in a software called Gazebo, which integrates with ROS nicely.
This is the course we must navigate. We are restricted to only two methods of interfacing with the robot:
- The camera feed
- Twist commands (move forward, move backwards, turn left, turn right)
With these two I/Os, we must design a robot which will accurately report license plates and their location through a ROS message.
I know what you may be thinking: since this is in virtual space, can’t you just hardcode motion into the robot, select only a certain area of the screen after a timer goes off, ect, ect..?
The answer: yes, I guess you could do that. But why would we? We are here to learn, dammit. So if you ask yourself “why didn’t they just use this simpler exploit given to them by the nature of the simulation?”, the answer is most likely “for the sake of knowledge and art”.
With that said, like literally every other team I’ve talked to, we did end up resorting to a few cheap tricks. The plates are found with a somewhat selective colormask that is restricted to a certain field of view. I’m not proud of it, but sometimes you just need to plop in the ugly solution and get er done.
Yes, one could set up a full CNN plate reader which scans the whole image for characters, such as this beauty . Would indeed be dope, however in terms of training time this model is far from ideal and considering our short time frame it a high risk strategy to rely on one single complex method to do everything for us. Furthermore, the nature of the plates all being the same size and color makes it extremely attractive to break this into two separate systems: one that collects the license plate and another that tells us what is on it.
We had originally planned for the plate detector to be another object detection neural net. Due to time constraints we moved to an OpenCV color mask instead. It’s really nothing impressive so I’d say you can just skip this section.
Once the color mask was chosen, we passed it through an opening in OpenCV (that is, erosion and then dilation) to remove the random white pixels here and there. We then passed that through findContours, and s so that rectangular shapes are included with a minimum bounding height and width.
The problem here was that, for some reason, openCV also counted the road features as white things with 4 edges (see figure 2). This was a simple fix, which involved filtering for the purple in the image, once again finding contours, and then scaling the filled bounding box of the purple so that the white licence plate is always covered. Using that bounding box as another bitwise and mask, we have the final product which is effective and fast:
Observe that it is not perfect. Because the license plate is not included in our color mask (only the “true” white), the bounding box has been simply stretched a little bit. This works for angles that are head on but it leads to imperfection when the plate is read at an angle.
Lab 3 of this course was building a CNN to read the characters of a virtual license plate. However, since images were perfect (skew, or color deformation), and we can generate a lot of them, it achieved 100% accuracy within the first epoch….
In the real competition, running through Gazebo, where there is skew and color deformation due to lighting and camera effects, the task is not as simple.
To generate data, we employed two methods:
- Create a generator python script, which would generate plates and artificially Skew them in front of collected backgrounds. This had the advantage of knowing the corner locations and text of any huge number of images, but does it map to the real virtual world?
- Collect real-camera data through the use of a bash script. Said script launched spawned the robot in a specific location, turned it ever so slightly, killed the Gazebo model, and then did the whole thing over again. This has the advantage that it is real world data, but is it comprehensive? Unfortunately it also neglected to kill the xterm keyboard controller so after running it overnight my computer looked like this:
- Of course, there was also the option of manually driving around, gathering and labelling data. Sounds like a bad idea, but once you realize that you could have over 1000 images of juicy REAL data in under 3 hours, it becomes pretty appealing.
We elected to use methods 1 and 3. At the end of this process, we had over 700 simulated license plates and over 1000 real license plates. Example Data is shown below:
Both types of data were they were unskewed and separated into individual characters (see the python notebook below) so they would be fed into the neural network. Each 40×80 character image looked this:
Keras Neural Network
This is the python notebook housing the neural network. Most of it is just piping the data to the correct form (unskewing simulated data and cropping, the result being what you saw above). But in the end we were left with this model and an accuracy of over 95% on real data:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_4 (Conv2D) (None, 76, 36, 32) 832 _________________________________________________________________ max_pooling2d_3 (MaxPooling2 (None, 38, 18, 32) 0 _________________________________________________________________ conv2d_5 (Conv2D) (None, 36, 16, 32) 9248 _________________________________________________________________ max_pooling2d_4 (MaxPooling2 (None, 18, 8, 32) 0 _________________________________________________________________ conv2d_6 (Conv2D) (None, 16, 6, 32) 9248 _________________________________________________________________ max_pooling2d_5 (MaxPooling2 (None, 8, 3, 32) 0 _________________________________________________________________ conv2d_7 (Conv2D) (None, 6, 1, 32) 9248 _________________________________________________________________ flatten_1 (Flatten) (None, 192) 0 _________________________________________________________________ dense_2 (Dense) (None, 60) 11580 _________________________________________________________________ dense_3 (Dense) (None, 36) 2196 ================================================================= Total params: 42,352 Trainable params: 42,352 Non-trainable params: 0 _________________________________________________________________
That’s all, Folks
There you are! It works. It is accurate enough.
I realize this report is quire surface level, so if you have any questions, please reach out! Though I truthfully don’t think there is anything too new to share here. But, here is the takeaway: this is a course for engineers. The main point here wasn’t creating a fantastic neural network and showing off technical prowess, it was creating something that works well in a short amount of time. This is NOT a research course. It is a project course, and therein lies the difference (in contrast to my opening statements about the art and beauty of our creation).
Also, remember it really doesn’t matter how fantastic your model is if: 1. You have shitty data and 2. your model is not a realistic depiction of the real world. I suppose this is the truth for any model you create, not just ML models (a la Nassim Taleb). I’ve talked to students who have run over 200 epochs on their model and generated 5-10x the images I have, for perhaps only a marginal increase in accuracy. That, I would say, is the power of having real data. You just can’t beat it.
These are some of the cool things I have found while doing research, hopefully they will aid in your next computer vision project: