In a previous post we explored the idea of using large language models (LLMs) to query structured datasets such as structured data files and databases. We introduced the Sketch library which integrates with Python and can be used in a Notebook to query large datasets. Sketch methods can be used for directly asking questions in natural language of the dataset, or as a coding assistant that can translate questions from natural language into code. We asked simple questions and evaluated the responses. The coding assistant performed well with translating natural language questions into code that produced correct answers.
In this post we evaluate the ability of coding assistant to translate complex questions into code and subsequent answers. We use the same therapeutic antibody database from the previous post. We ask some questions that would be of natural interest such as “what is the distribution of the biological targets for these therapeutic antibodies?” Interestingly we can ask in natural language for the code which will plot the distribution. Here is what is looks like:
df.sketch.howto(“Plot a bar chart for distribution of top 10 targets for therapeutic antibodies in this database”)
Here the distribution of targets is expected with PDCD1 or PD-1 having the most number of antibody therapeutic agents. PD-1 has been one of the success stories of the age of biologics discovery. However, one might notice that some of the other important and well-known targets are missing. For example, PD-L1 is a ligand for PD-1 and also a target for antibody therapeutics. Turns out that PD-L1 is also known by its alias CD274 which is present in the distribution. If this analysis was consumed by someone who is not a subject matter expert, they would miss this. We found other such examples. This is a problem of data normalization. This dataset is not normalized and there is no data dictionary being used to map all aliases to a common biological name. This is very common problem of complex datasets and normalization is always part of a typical data engineering pipeline. However, data normalization is plagued by the dimensionality problem. It is often hard to anticipate how many levels of data normalization is required for the analysis to be robust for insight generation.
One more example of a complex question that we asked of the dataset in natural language. We wanted to plot the yearly distribution of Whole Monoclonal Antibodies, Antibody-Drug Conjugates (ADCs), and Bispecific Antibodies. Here is the question and the plot that was generated
df.sketch.howto(“Plot a bar chart for yearly distribution of three series – Whole mAb, ADC-Whole, and Bispecific mAb”)
In this example the natural language query generates and executes the code to create the above plot which gives a temporal view of the monoclonal antibody development and the emergence of antibody-drug conjugates (ADCs) and Bispecific Abs as new classes of therapeutic agents. While this analysis gives a good idea regarding emerging trends, it also suffers from data quality and normalization issues that we discussed previously. Not all ADCs are listed in this dataset, and it is skewed more towards Whole mAbs. For example, while datopotamab is listed as a Whole monoclonal Ab, The ADC datopotamab deruxtecan is not present. One would only be able to validate this if they were highly knowledgeable about therapeutic mAbs and ADCs. Admittedly, this database is mostly focused on structures of therapeutic Abs and cannot be used as a source of truth for all the metadata. It can be augmented and normalized for more robust insights and analytics. If you are interested in an improved version of this dataset, please contact us.
So far we have demonstrated the utility of a generative AI tool that allows Q&A on structured datasets. This dataset is quite small and the analysis can be carried out in a notebook using pandas library in Python very easily. But what about large datasets and how well would this perform on larger datasets? It is known that pandas can handle datasets that are around 2-3 GBs in size but beyond that it becomes intractable due to memory issues, since pandas loads all the data into memory. In the next and last post, we will evaluate the performance of natural language queries on a much larger dataset. We will load Clinical Trials datasets that are 1-3 GB in size and we will post about the performance and quality of results. Stay tuned.
Final Verdict
Generative AI based models are able to translate questions in natural language to code that can be executed on a structured dataset to generate complex analyses and insights. These code generation methods are quite robust, don’t hallucinate, and if the question is framed properly (prompt engineering) then it produces correct outputs. The validation of output is still dependent on knowledge of the subject and the domain which is more important than having the technical skills to query the data. Are we ready to leave the data engineering out of the workflow? Not yet. But the insight generation cycle can be accelerated significantly.