Bugs Faster than the Speed of Thought

I got access to OpenAI’s GPT-3 last year and one of the first things I did was prompt it with a C++ interface struct and have it write the implementation. I was generally surprised by the results. Some of the completions were even code that was clearly from Github projects with valid Github links. My thought was “Wow, this would be an impressive auto-complete”. Today, Github just released Copilot, which is a GPT-3 powered auto-complete feature. It’s very impressive.

Anybody who has created a production AI system will know that only 20% of the work goes into creating the models, the scaffolding around it is the remaining 80%. I’m sure it took a lot of work to go from using the GPT-3 playground to something well integrated into an IDE like Copilot.

Being well integrated is key to the success of Copilot and it’s going to be used by hundreds of thousands if not a million programmers very quickly. Which is precisely what makes it so dangerous.

In Code Complete, Steve McConnell wrote extensively on defects in production systems. The industry average defect rate is about 15 – 50 bugs per 1000 lines of code. Some techniques used by NASA can get bug count to almost zero. Open source software likely has MORE bugs per 1000 lines of code because most open source projects have 1 developer and no eyeballs.

Copilot isn’t magic and will perform worse than a human coder on average. If it’s trained on the gigantic, 100 million project corpus of Github projects, it will most certainly have more than 50 bugs per 1000 lines of code. This is faster than Copy-Pasting code snippets because Copilot will auto-complete code that will likely compile and require less human correction. All programmers understand why copy-pasting code is bad. It likely introduces bugs. With Copilot, bugs will be transmitted faster than the speed of thought.

What can the consequences of buggy software being written at a breakneck pace be? The fatal Boeing 737 MAX8 crash involving Ethiopian Airlines in 2019 was the result of AI gone wrong. They took a safety system that was supposed to only engage in critical situations and expanded it to noncritical situations. Black box systems kill. Imagine this for a second, building AI systems is the future of software. You will no longer write algorithms but the scaffolding of learning systems. Now imagine your scaffolding itself is written mostly by Copilot. Bugs will propagate in new ways, via systems that build systems.

Building software is building a small world. It’s about meaning, and we know GPT-3 doesn’t understand meaning. It won’t understand your problem either. When programmers get used to auto-complete code that compiles, how deep will they go into it? Will they review it carefully? Building human-machine interaction is hard and you don’t want humans writing software asleep at the wheel.

Advertisement

Tools Were Made & Born Were Hands

If you haven’t read the famous old school rant Worse is Better from Richard Gabriel, you should. Richard never really decided if he was right or wrong, and even wrote a rebuttal under a pseudonym called Worse is Better is Worse. After over ten years of thinking about it, he finally wrote in 2000 “risk-taking, and a willingness to open one’s eyes to new possibilities and a rejection of worse-is-better make an environment where excellence is possible.” Just like Richard, I have muddled through this idea all of my career. I’ve worked on ideas with careful precision but never fully baked and never released. I’ve come to the conclusion that ultimately what we build is scalped by human hands, and only when it touches our hands does it form.

This is why it’s important for a software system to touch the real world, to be delivered and touch human hands. It turns out software isn’t valuable until it’s running. Because ALL software is a service, even when it doesn’t have a network connection, all software is there to serve our needs and therefore when it isn’t running, it serves nobody. The corollary is that software can also harm, and harmful software not running is better than software that runs. Our role as professionals and human beings is to build software that serves and prevent software that harms.

Harm can be categorized in various ways, and one category is software faults. I believe we can have fault-free software following a straightforward process of fixing every bug that rears its head immediately. Ideally, before the software touches the intended user’s hand. Therefore it’s essential to have techniques to prevent bugs from happening at the beginning like good design, and also catching them early with Design by Contract and other methods. It’s easy but requires discipline, and it’s possible because I’ve seen it first hand.

Maybe this is where Richard is both right and wrong about worse is better. It’s better to deliver software because it can be formed by the world and human hands. It can start serving people and providing benefits. At the same time, you must not deliver software that has bugs because software has a tendency to spread like a virus, and small problems quickly become big ones at scale.

I believe you can have your cake and eat it too. You can release software frequently and keep it fault-free. Where I conclude is not Worse is Better, or Right is Better, but Serving is Better. Don’t give up producing, serving, and fixing. Don’t give up sculpting.

Deep Latent Space Maps

As I watch my children grow up, I am always amazed at their pace of learning. It turns out by 18 the average person learns about 60,000 words which average about 9 words a day. Typically these new words are learned by as little as one example of the thing or concept learned. Anyone working in the AI field understands that one huge drawback of the latest NN based learning systems is that the number of examples the system needs to learn a label is enormous. Typically, in the thousands to millions of examples, which is often out of reach for most researchers in the field, which is why there are public datasets used by researchers (TIMIT, MNIST, and ImageNet, ActivityNet, etc.) for training and testing new models.

I think it’s clear to most researchers that Deep Learning has a deep problem because these systems learn in clearly inferior ways. As I watch my children learn, I am deeply unsatisfied with current techniques and know there must be a better way. I am not the only one who feels this way, and many started looking at older ideas in ML to combine them with Deep Learning (Deep Learning itself is an old idea).

There is some exciting research coming out suggesting that there are these grid cells which represent objects in many domains, and they take input from the senses and vote on the best model of the world they are perceiving.  This research seems to partly validate Geoffrey Hinton’s idea of capsule networks which take many models and vote on a prediction to higher level features. Typically the capsules vote for manually selected features like the position and orientation of an object in Hinton’s case. It seems we can learn the features instead because it’s possible the brain uses grid cells to map to features in an abstract latent space.

One promising approach to learning this mapping is to use deep learning to learn embeddings from some medium, to take input data and map it to some latent space. This is called Deep Feature Extraction. Such a trained network can convert an image, sound, or text into a Feature Vector. This technique is useful because this can be done unsupervised only requiring many examples of data (images, sound, text, ext) instead of labeled ground truth. These learned feature vectors are then typically used downstream by some other technique (xgboost, svm) to learn labels.

One area I want to explore is to take Peter Gärdenfors idea of a Conceptual Space and combine it with deep feature extraction and Geoffrey Hinton’s idea of capsule networks. What I want to do is to use deep feature extraction to produce feature vectors in latent space. Then the Labeled data can be used to map out the latent space over several domains using Voronoi space partitioning. You can then train many of these mappings into the domain of your choosing and use a voting mechanism to extract the probable labels. I call the latent space partitioning using labeled domains Deep Latent Space Maps.

The devil is in the details, but in principle, a learning system becomes creating these Deep Latent Space Maps. Classification is then taking the inputs through all the mappers and using the voting mechanism to extract the probable labels. In other words, you can train it using only as little as one example of an object, word or thing. As more examples are provided, the map can be re-adjusted to take in the new information. The interesting bit here is to explore using Peter Gärdenfors idea of how learning works in our brain which new research is increasingly validating. It just feels right.