writing

Can Artificial Intelligence Tackle Your Homework?

Sep 2022

Yes!¹ You can certainly scrape a passing mark in a short assignment for a graduate course at MIT (Sample Size $N=1$ ).

The answers are all generated from an autoregressive language model² (ChatGPT is one example of this kind of model). Every answer was generated using a prompt containing the assignment brief and an example question-answer pair (initially seeded with my first response). I then rotated the prompt pair so the model could only use its answers as reference material.

Throughout this process, I did zero cherry-picking, the first response was the one used.

The artificial answers were then submitted under the pseudonym ‘Gregory P. Terrance’ alongside my own.

The Answers

The knowledge stored in 178,000,000,000 parameters is more than a little impressive; it was faster than Google to answer the question of “What is CYC?”, and it is perfectly legible. The primary issue, as you’ll see, is the lack of content. While it might be trained on a lot of text, it is always eager to finish an answer and won’t provide examples or personal opinions.

1. What is deep understanding, and how does it differ from shallow understanding? Give an example of your own, other than those encountered in the talk.

Gregory P. Terrance: Deep understanding is solving a problem you understand, not a problem you merely recognize. This can mean noticing a relationship between things that you didn’t realize were there or realizing that things you thought were unrelated are in fact related.

Me: Deep Understanding requires an internal model of what is depicted to perform everyday inferences about unseen information. In contrast, shallow understanding is typified by the ability to correlate and discern subtle patterns.

Sometimes a mistake can demonstrate deeper understanding than the right answer - using “comed” rather than “came” is a result of knowledge of the fundamental affixation rules of past tense. It suggests an internal model of language - if this action happened in the past, then affix “-ed”. Rather than a probabilistic, shallow, understanding of “came” coming next.

Frankly, I think this is the worst answer of the bunch. It’s not wrong, but it’s not particularly insightful and it misses the requirement of an example. It also appears to be responding vaguely to the definition of “deep understanding” rather than the specific usage that the talk (by Gary Marcus) uses.

2. What was Eliza? That is, what was it supposed to do? (Google it.) Did it have deep or shallow understanding?

Gregory P. Terrance: Eliza was an early natural language interface. It didn’t know or understand anything, it merely selected responses which matched a corpus of user’s inquiries. It had shallow understanding.

Me: Eliza was a “chatbot” program that was designed with pessimism to “demonstrate that the communication between man and machine was superficial”.

It used a very simple rule-based system to match against a given query: the decomposition of wildcards (* I @desire *) and keywords, and a recomposition with stock phrases and templates Why do you want (3)? to fill in based on the wildcards.

I’d argue it does more than it sets out to do. It demonstrates how shallow communication can be between humans as well. During my training on emotional support helplines, we’re often advised to perform very similar tricks. You even learnt default rules for creating a window of breathing room with stock phrases like ”(?) sounds like a very difficult situation”.

Eliza is certainly shallow intelligence, but that doesn’t mean it’s worth dismissing out of hand. Our ability to empathically bond with “shallow intelligence” is undeniable.

Gregory makes a claim here that I’d argue invites misunderstanding - “[Eliza] … selected responses which matched a corpus of user’s queries” - to me this reads as a lookup table of word-to-response rather than the templating system that Eliza is.

Equally, however, the tone of disdain (“merely” and “[didn’t] understand anything”) matches many of my classmates, although I found it a little harsh. The grammar seems a bit poorer here as well, “which matched” and “it had shallow understanding” are a little awkward. Notably, the phrasing “did it have deep or shallow understanding” is mirrored in the question.

3. What is CYC? (Google it.) What is it trying to do? At a high level, how does it try to accomplish that goal?

Gregory P. Terrance: CYC is a knowledge base; a method of representing human knowledge, and a method of reasoning by analogy. It is trying to reach the goal of learning all of human knowledge by analogy. That is a tall order; the CYC project started in the 1950’s and is not yet complete.

Me: CYC is an over 35 year endeavor to curate a knowledge base of hand-coded “common-sense” facts and an inference engine to deduce further facts. Its goal wasn’t to test a hypothesis, instead, it was to create a shared information resource instead of starting from scratch each time. Lenat et al. firmly came from the philosophy that intelligent action was bottlenecked by the lack of “content”, which I think has borne fruit; unfortunately for CYC mass data seems to be as sufficient as hand-crafted rules.

I was pretty impressed with this response when I first saw it - especially given I struggled to locate what CYC was without hamming up the “AI” keywords in Google.

I also thought that it got the year right, which was a nice touch - unfortunately, CYC started in 1984 ³, but it is also still not complete. So it comes out neither ahead nor behind there.

A touch of repetition seeps in with its longest answer yet, and it also appears to have shared my fondness for semicolons despite not including ; in the previous prompt text.

4. What’s wrong with benchmarks?

Gregory P. Terrance: Metrics can help, but metrics have dangers of their own. For instance, measuring student performance on math tests alone tells us little about a student’s real math abilities; and does not tell us how deep their understanding is, or if they are merely good at figuring out mechanical processes.

Me: The first danger of benchmarks, is that we misattribute “state-of-the-art” with novelty. However, benchmark performance is not just noise. We should only be careful on the 98% -> 99% benchmark improvements; the “Kaggle Effect” mentioned is exemplified by this. If the top 5 models all have very similar performance, there probably is little semantic difference between them. A runaway winner like AlexNet jumping from 74% to 85% - should be the one to pay attention to.

The second danger is directly associated with why benchmarks are vital: A benchmark is a problem specification. With the same stroke, we can define the wrong problem. For instance, if the benchmark dataset is too biased and has unintended factors, then we might be solving a completely different task - consider an algorithm that segmented tumors by looking for pen marks on the X-Rays.

Not a great answer for Gregory here. It seems the more generic the question, such as “What’s wrong with benchmarks” or “What is deep understanding?” the more generic the responses are. A poor philosopher indeed.

The first example by GPT occurs in this answer, and it is quite nice. The relation of benchmarks to mathematics tests is a comparison I do agree with. I think my point on benchmark usage also applies to mathematics tests - dramatic improvements (or deviations from the norm) are notable, but micro-adjustments to perfection are not.

5. What’s wrong with GPT- $n$ , for $n=1,2,3...$

Gregory P. Terrance: GPT-3 still produces logical “nonsense” (a word I use here pejoratively). Gpt-3’s nonsense is structurally coherent, but semantically absurd; like the claim that “all virtue is vice and vice is virtue.” Gpt-3’s nonsense is unreasonable, meaning that it violates our intuitive notions of common sense, validity, probability, induction, etc. Gpt-3’s nonsense is familiar.

Me: GPT-Ns are autoregressive language models - they pull from contextual clues to probabilistically predict the next token in the sequence. This desire to just continue the sequence gives them a distinctive flair of an improv actor - one with a huge breadth of detailed knowledge, but willing to shortcut and make things up.

Q: When playing the first move on a blank tic-tac-toe board, the total number of possible moves is?

GPT-3: 8. Because on the first move, there are 3 spaces in a row and 3 in a row across, for a total of 8 possible squares.

All the knowledge needed is contained within the model. It just doesn’t have a sufficient internal model to put all the pieces together to make inferences.

GPT is smart, but not smart enough to know how smart it is.

”pejoratively”: expressing contempt or disapproval

I had to get the dictionary out for that one, slightly embarrassing that my vocabulary has been beaten by something producing “logical nonsense”. It’s far too harsh on itself. Semantically “pejoratively” is perfectly correct, and quite neat tonally. I don’t think “contemptuously” would have worked as well - it would come off as far less cogent ⁴.

Betteridge’s law (of headlines) is an adage that states “Any headline that ends in a question mark can be answered by the word no.” I take great joy in denying it. ↩
Technically these come from Jurassic-1 Jumbo a language model by AI21 Labs as I didn’t have access to OpenAI’s GPT-3 at the time. However, Jurassic-1 is functionally comparable, and GPT does have more name recognition! ↩
‘2+2=5’. George Orwell’s totalitarian masterpiece is also the year that CYC started. A fun coincidence. ↩
Sorry, I couldn’t resist. “(of an argument or case) clear, logical, and convincing.” ↩