Sitemap

Can an AI be a data scientist?

8 min readDec 6, 2022

OpenAI’s ChatGPT is blowing data scientists' minds. Could it steal their job?

OpenAI ChatGPT

Recently OpenAI released its latest model the chatbot: ChatGPT.

In this article, we discuss this model, its limitations and merits, and do a little experiment.

ChatGPT in a Nutshell

ChatGPT is basically a language model capable of dialogue. As its name implies it is based on the GPT 3.5 family (an evolution of GPT3 that was trained on both text and code). ChatGPT is derived from InstructGPT, another OpenAI work to create a model capable of following user instructions using reinforcement learning. One of the strengths is that it has been trained on an impressive amount of text, allowing the model to be flexible for many use cases.

ChatGPT has excited the community because it is currently the best chatbot out there and you can experiment with it.

What can it do?

  • Interacting with ChatGPT seems like a much smoother and more human discussion than other chatbots so far.
  • GPT3 can write poetry, and explanations, and provide code. Moreover, you can decide the style in which you want the explanation (for example, described in the Victorian style, explain as if you were a child, etc…)
  • ChatGPT can write an essay in less than ten seconds (although this is a capability already seen with GPT3)
  • It could be used as a search engine to answer specific questions, as some suggest.
  • It can admit and fix its mistake which is actually impressive.

what are the limitations?

  • The model can hallucinate, like other models presented so far, when ChatGPT does not know the answer it still tries to answer (e.g. by providing wrong or nonsensical information).
  • However, the essays it produces are often repetitive, sometimes inaccurate, and still require correction.
  • The model often writes nonsense but is able to make it plausible, so one must still always maintain a critical attitude

Indeed, similar patterns open up some interesting perspectives but also some disturbing questions. For example, the more the models improve the more they could be used by, for example, students to cheat in their assignments. Moreover, the model is capable of generating nonsense (while also making its errors plausible), risking misleading its users. Therefore, we discuss finding ways to identify what is written by artificial intelligence models (a kind of fingerprint that can be recognized).

Can ChatGPT be a data scientist?

I took a cue from this Tweet for this experiment, why not conduct a short interview with ChatGPT on a data scientist position?

The plan is some general questions, then use questions that have actually been used in interviews with DeepMind (or Google, after all, it is the parent company).

Let’s try.

Welcome to this interview. why did you apply for the position of data scientist at DeepMind?

OpenAI new model ChatGPT

What do you know about the company and the company’s value?

OpenAI new model ChatGPT

Which project presented by the company excited you the most and why?

What is your experience as a data scientist? Tell me about the project you are most proud of

OpenAI ChatGPT

What are the technical difficulties you encountered and how did you solve them?

OpenAI new chatbot ChatGPT

I would like to ask you some technical questions. Imagine you have a dataset, where there are a number of missing values. What would you do to resolve this issue?

OpenAI ChatGPT
OpenAI ChatGPT

write the python code for a geometric algorithm for calculating the value of Pi

(The question is taken by glassdoor, I have then tested the results, you can find them in this Google Colab)

OpenAI ChatGPT
OpenAI ChatGPT

Interestingly I asked a similar question: “Write a code calculating a pi number using RNG” but the answer was the same

Definition of positive definite matrix

OpenAI ChatGPT

Write a function to generate N sample from a normal distribution and plot the histogram

OpenAI ChatGPT
OpenAI ChatGPT
OpenAI ChatGPT

How do you invert a binary tree?

OpenAI ChatGPT
OpenAI ChatGPT

Let’s see how it does with something quite difficult. For example, those questions that LeetCode considers difficult. I looked for questions that have been used in FAANG interviews.

The question is not always clear to the chatbot, in some cases like this one after several seconds of waiting you get an error.

OpenAI ChatGPT

Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.

OpenAI ChatGPT
function median(nums1, nums2):
# Merge the two arrays into one sorted array.
merged = []
i = 0 # Index for nums1
j = 0 # Index for nums2
while i < len(nums1) and j < len(nums2):
if nums1[i] <= nums2[j]:
merged.append(nums1[i])
i += 1
else:
merged.append(nums2[j])
j += 1

# Add the remaining elements of nums1 (if any) to the end of merged.
while i < len(nums1):
merged.append(nums1[i])
i += 1

# Add the remaining elements of nums2 (if any) to the end of merged.
while j < len(nums2):
merged.append(nums2[j])
j += 1

# Find the median of the merged array.
if len(merged) % 2 == 1:
# Odd number of elements.
median = merged[len(merged) // 2]
else:
# Even number of elements.
median = (merged[len(merged) // 2 - 1] + merged[len(merged) // 2]) / 2

return median

I tested here the code and it seems to work fine.

Given an unsorted integer array nums, return the smallest missing positive integer.

OpenAI ChatGPT
OpenAI ChatGPT
OpenAI ChatGPT
OpenAI ChatGPT

We can note:

  • first, it was not necessary to provide pseudocode. After all, we wanted the code in Python and wrote it right away.
  • Second, the answer is not complete; we probably ran out of tokens per answer.
  • Also, I tested the LeetCode example (as input it uses an array [3, 4, -1,1]) so probably the LeetCode questions were used to train the algorithm.
    Despite this, it seems plausible but if you test the result is wrong
OpenAI ChatGPT

Given n non-negative integers representing an elevation map where the width of each bar is 1, compute how much water it can trap after rain. if the heights are: height = [0,1,0,2,1,0,1,3,2,1,2,1], how much water can be trapped?

image from here

ChatGPT provides the algorithm code (for brevity you can find it in Colab). It adds:

OpenAI ChatGPT

The result is still wrong.

In this Google Colab, you can find the code generated by ChatGPT and the testing:

Conclusions

The result is impressive, but at the moment we will not have an AI data scientist (and if I were you I would not use him as a secret interview aid).

As we have seen it depends on how one formats the question the model may not be able to understand the question, could be inconsistent, or add superfluous information (sometimes the answer seems stereotypical).

and show interesting examples of how things can go wrong (here and here).

ChatGPT is impressive because of the amount of text and code it has been trained with but is incapable of reasoning. On the other hand, the answers even when they are wrong seem plausible so they could be dangerous if used without caution. What do you think? let me know in the comments.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (31)