With the arrival of large language models (LLM), it has become quite commonplace to ask questions of unstructured long-form content which is present in documents in a multitude of formats. This can be done using commercial LLMs deployed via API endpoints (most famous being GPT by OpenAI) or recently by deploying open source LLMs (Llama2 by Meta AI gaining in popularity). Document Q&A on PDF files with LLMs produce robust outputs, especially, with the LLMs that are on the top of the leaderboard in benchmark tests. LLMs certainly are known to hallucinate and produce incorrect/false/incoherent answers but that is their generative nature. That said, a number of strategies exist to reduce the hallucinations effectively (more on this in another post).
However, a lot of information in today’s world exists in structured datasets, be it csv files stored locally, parquet files in enterprise cloud or data that is stored in relational or graph databases on prem or on the cloud. A significant amount of time and effort goes into creation of these datasets in a structured format with well-thought-out schema or graph with internal and external dependencies. Even more effort goes into maintaining these datasets, their quality, and querying them to extract business value at a time when it is needed. Usually, this is accomplished by a team of data engineers and analysts with deep knowledge of the data and domain. But with time the information value of these datasets reduces due to high cost of maintenance and information extraction. We think of the inaccessible information and insights locked in these datasets as “Dark Data”.
Another exciting development that has happened with arrival of LLMs is that these models have opened up structured data to queries using natural language – essentially talking to structured datasets in the same way one would communicate with an analyst. The implications of this are profound and one can immediately start to see the possibilities. It democratizes data access and brings leaders closer to data driven decisions. It promises to free the “dark data” along with insights that are locked inside these datasets accessible only by technical and domain experts. There are also implications for data QA and data management practices once specialized skills are not needed to query structured data.
The first experiment in this series is using the python package Sketch to query a structured dataset inside of a Jupyter Notebook. This is the simplest setup that we have come across so far that lets one use natural language to query a structured dataset. Sketch library can be installed locally or on cloud-based Jupyter Notebook such as Google Colab. It integrates well with Pandas package and once the data is read into a pandas dataframe Sketch methods such as .ask and .showme can be used to query the data using natural language.
We downloaded the therapeutic antibody dataset from The Structural Antibody Database. This dataset of therapeutic antibodies has 885 rows and 23 columns with data on therapeutic antibodies in all stages of clinical development till Jan 2023. The dataset was read from a csv file into a pandas dataframe. Here is a snapshot of part of the dataframe.
Sketch method .ask is good for describing the data and doing counts of columns rows and other descriptive statistics. Following is an example of natural language query that lists the column names in the dataset:
df.sketch.ask(‘List all the columns separated by commas’)
Therapeutic, Ab Type, CH1 Isotype, VD LC, Highest_Clin_Trial (Jan ’23), Est. Status, Heavy Sequence, Light Sequence, Heavy Sequence (if bispec), Light Sequence (if bispec), 100% SI Structure, 99% SI Structure, 95-98% SI Structure, Year Proposed, Year Recommended, Target, Companies, Conditions Approved, Conditions Active, Conditions Discontinued, Development Tech, Notes
Sketch does well with simple natural language queries such as the above and outputs are correct. However, as questions become somewhat more complex, Sketch .ask method starts to produce unreliable answers. These answers appear to be correct however upon validating with the real data errors are found, more often than not. Consider the following example:
df.sketch.ask(‘How many therapeutic Antibodies are in Phase III?’)
There are a total of 885 therapeutic Antibodies in the dataframe df. Of these, there are a total of 154 therapeutic Antibodies in Phase III
On the face of it, the answer seems reasonable, however, after validation, it turns out that there are in fact 125 therapeutic antibodies, not 154, that are in Phase III in the dataset. Sketch did not get the correct count. Similarly, consider the following query:
df.sketch.ask(‘List all the conditions approved for Guselkumab from Janssen?’)
Conditions approved for Guselkumab include Plaque Psoriasis, Psoriatic Arthritis, and Juvenile Idiopathic Arthritis.
The answer above is incorrect. The Conditions Approved for Guselkumab listed in the table are: Erythrodermic psoriasis;Palmoplantar pustulosis;Plaque psoriasis;Psoriatic arthritis;Pustular psoriasis. Here the LLM is missing out on some of the conditions approved that are listed and seems to be hallucinating about Juvenile Idiopathic Arthritis which is not present at all. The takeaway here is that the answers from the natural language query can’t be relied upon.
While answers from .ask method can’t be relied upon which is disappointing, the .howto method seems to reedem the situation. The .howto method is basically a coding assistant that writes the code in python for a certain query in natural language. Running this code produces exact and deterministic result without the problems seen above. In our experiments, this approach still allows the dataset to be queried in natural language while eliminating errors and hallucinations since the output is coming from execution of the code. Following is an example of how it works for the same questsion:
df.sketch.howto(‘Extract conditions approved for Guselkumab from Janssen?’)
# Code to extract conditions approved for Guselkumab
guselkumab_conditions = df[df[‘Therapeutic’] == ‘Guselkumab’][‘Conditions Approved’].unique() print(guselkumab_conditions)
Answer: [‘Erythrodermic psoriasis; Palmoplantar pustulosis; Plaque psoriasis; Psoriatic arthritis; Pustular psoriasis’]
The question is translated into code which when executed gives the correct answer. One more example:
df.sketch.howto(‘List all the therapeutics from Janssen in Companies column?’)
# Code to print all the therapeutics from Janssen in Companies column
janssen_therapeutics = df[df[‘Companies’].str.contains(‘Janssen’)][‘Therapeutic’].unique() print(janssen_therapeutics)
Answer: [‘Abciximab’ ‘Amivantamab’ ‘Carlumab’ ‘Cetrelimab’ ‘Daratumumab’ ‘Duvortuxizumab’ ‘Fulranumab’ ‘Golimumab’ ‘Guselkumab’ ‘Infliximab’ ‘Intetumumab’ ‘Mitazalimab’ ‘Nivolumab’ ‘Onvatilimab’ ‘Posdinemab’ ‘Rituximab’ ‘Siltuximab’ ‘Sirukumab’ ‘Talacotuzumab’ ‘Talquetamab’ ‘Teclistamab’ ‘Tepoditamab’ ‘Tesnatilimab’ ‘Ustekinumab’ ‘Voxalatamab’]
This strategy of using the .howto method to generate the code produces the correct answer and hence is reliable for interrogating the dataset. In our experiments so far the answers generated using this approach are correct. However, it is quite possible that as the queries get highly complex the generated code will not produce the correct answers. This is something that needs to be investigated. The ease of set up via cloud based notebooks makes quick iterative experimentation possible. If you are interested in asking questions of this dataset with the notebook, please contact us
Final Verdict
The Sketch library in Python is fast, lightweight and easy to set up, and allows natural language Q&A of structured datasets. But asking questions directly in natural language is unreliable and produces incorrect answers. However, as coding assistant it does a good job of translating natural language to code that produces correct answers. Hence, it can potentially prove to be useful tool for Q&A on structured datasets
If you liked this post please sign up for part II & III of this series