top of page

What's Going on with LLMs?


Happy birthday to… ChatGPT! A year has flown by since the chatbot was released, and its uses have proven to be, uh, interesting? For example, a few months ago two lawyers were fined for submitting fake court citations from ChatGPT in an odd situation where the chatbot invented six cases that were then used in an injury claim - whoops! Interestingly, the judge presiding over the case wrote that there was nothing ‘inherently improper’ about using AI in legal work, but that any derived findings should be checked for accuracy. OK, so maybe its use in law work is limited, but what about other fields? Well, the education sector is also preparing itself for unprecedented challenges since OpenAI’s release. Schools and universities have voiced concerns over the ability of the AI tool to produce high-quality essays, particularly with COVID having introduced a virtual exam era. And yet, its release has also prompted experts to reconsider what it means to examine youngsters in arbitrary fields - the Tony Blair Institute even suggested that GCSEs and A-Levels should be scrapped and replaced with assessments that better prepare people for working life. Fine, so ChatGPT is disrupting multiple sectors, so their office must be in good shape right? Well it seemed to be until OpenAI’s board decided to fire CEO Sam Altman before playing the Uno reverse card and taking him back. In summary, ChatGPT has had a drastic impact over its short life and there’s probably a whole lot more to come.

Ok, that’s good and all but what does any of it have to do with large language models (LLMs)? Well, the AI-powered chatbots such as ChatGPT and Google Bard are examples of LLMs, and these are systems that aim to predict and generate plausible language. So what is a language model? A widely used example of a language model is ‘autocomplete’. The models work by guessing the probability of a ‘token’ within a longer sequence of ‘tokens’. For example, a string of ‘tokens’ may read as:

When a patient has collapsed, the best investigation is ____ .

In this case, our ‘tokens’ are words and the language model estimates the most likely token to fit in. For example, our option ‘tokens’ and their likelihoods may be: ECG (10.2%), echocardiogram (8.0%), lying-standing blood pressure (7.2%), etc. For this to work, we have to train the model by feeding it lots of example sentences, and we need to be careful about which sentences we’re selecting, but more on that later. Models based on algorithms like this clearly have broad use potential, for example in generating text, answering questions and in translation. The obvious issue we have here is that there are looooaaaads of words and these are all used in different contexts based on the field we are exploring. Therefore the language models we need are increasingly complex and large. In fact they can be so complex that they can predict the probability of paragraphs and even entire documents. As a little thought experiment, do you think ChatGPT could have completed my previous paragraphs? Probably, and in a much quicker and eloquent fashion than I have - this again exemplifies the huge complexities of LLMs. These models also need to be trained on large amounts of data, and the sourcing and validity of this data collection has been brought into question.

Well there must be some negatives, right? The big LLMs are expensive - they can take months to train on data and therefore are energy burners. There is also a huge scope for data bias, especially with the data being based on human language - this can demonstrate bias in race, gender, religion etc that ChatBots can then unfortunately extrapolate from, prompting many to fight for the use of ‘responsible AI’.

Ok ok, so you’re hyped about LLMs, but it’s all about ChatGPT and that basically covers general domain knowledge. What about healthcare? GatorTron is an example of a healthcare-based LLM which was trained using over 90 billion words of texts from anonymous clinical notes from the University of Florida Health team, PubMed articles and Wikipedia. This has allowed GatorTron to outperform other LLMs in medical question answering (eg ‘what lab results does a patient have that are pertinent to diabetes diagnosis?’), and the developers hope that models like this can provide support to physicians making data-informed decisions and identify adverse clinical events.

So where do we go from here? Well it’s important to think about the steps we take in generating LLMs and how we can tailor these to benefit both physicians and patients. For example, an LLM trained through MCQs will likely only be able to offer academic answers to questions, but in reality we need answers that are standardised to current patient data - as such, we’re probably not very close to seeing ChatBots implemented readily into clinical practice. For example, despite having all information available to the model ChatGPT still is not close to 100% in the USMLE, and furthermore has been shown to provide inaccurate information in response to patient queries. But maybe we’ll see some other applications in the medical field first, as outlined by a review into Large Language Models in Medicine:

  • Research applications eg critical appraisal

  • Educational applications eg material production for teaching

  • Clinical applications eg administrative tasks (letters, discharge summaries)

So, in summary LLMs could probably reshape the medical field, but we’re a far way from that point or ChatGPT replacing doctors. LLMs have shown to reproduce biases from data and are susceptible to spreading misinformation, which could be dangerous in clinical contexts. There are currently no mechanisms built into these models to ensure that the output of LLMs are correct and as such clinical use is again limited. It’s clear that there are a fair few issues to be sorted before LLMs are readily used in medicine.


Author: George Nishimura, MTF Content Writer

Editor: Ramat Abdulkadir, MTF National Technology Director


41 views0 comments

Recent Posts

See All


bottom of page