Welcome back to the next in this three-part series, which outlines a pragmatic approach to selecting LLM-based tools for data-driven, enterprise applications.
In this blog series, I break down the tooling ecosystem for LLMs, starting with open-source tools and public cloud-based managed services, to help organizations select the right tool or suite of tools for their use case and understand how to best approach decision-making. (We are two parts in, and I still don't mention a single tool until the end of this post.)
In the last installment, we looked at the starting point for evaluation, the most viable use cases, and the non-negotiable elements on which to build.
In this installment, we do two things:
Along the way, we will begin to introduce some of the criteria we recommend you consider for evaluating and selecting the software and infrastructure tools once you have a firm handle on your use case and workflow.
A Task Component is any node you would find in an LLM-based workflow, including the jobs you would ask an LLM-based tool or system to do. Task Components can be combined to accomplish meaningful goals, but not all tools and frameworks can connect each of these Task Components together.
Therefore, the selection of the tool or framework depends on the Task Components that need to be combined, the complexity of the workflow graph (diagram) to be constructed, and your willingness to take a service-based approach to LLM-based workflow development.
Some frameworks only support a limited set of Task Components. Other frameworks only support linear workflows. Still, other frameworks outsource the construction of the workflow graph to the LLM itself or decompose the graph into agents that operate independently.
Below, I provide a few examples of workflows and corresponding task components. These examples are meant to be illustrative but representative of the types of components that would typically be incorporated into an LLM-based system.
We recommend constructing the workflow graph yourself, with the possibility of conditional nodes using LLM-based or directly coded “reasoning” to lead to other segments of the graph. Repeated segments of the graph can appear as loops.
If you have an LLM-based use case in mind, find a piece of paper, a LucidChart, or a Miro board, and start diagramming the workflow of your use case using the task components described below. Understanding which task components you will need and where each task component will integrate with other systems and other task components helps to determine which tools you should use to construct the workflow. This diagram will likely resemble a directed acyclic graph (DAG)*.
Note: A directed acyclic graph (DAG) is a conceptual model used to represent a sequence of operations or tasks where the vertices (nodes) represent the tasks, and the directed edges (arrows) indicate the dependencies [or information passed] between these tasks. The "acyclic" part of the name implies that the graph does not contain any cycles; this means there is no way to start at one task and traverse through the dependencies to arrive back at the original task. This property is crucial because it ensures that there are no infinite loops or deadlocks within the workflow. Source: ChatGPT
Below, we list the basic building blocks of LLM-based workflows.
Please note that not all of these components need to be completed by an LLM. Although an LLM can be used to complete all of these components, any of them could be achieved by a human in the loop, another model in the loop, or an API in the loop. The beauty of identifying the task components is that you can understand the nature of the workflow to be completed before you selectively integrate the LLM to complete aspects of that workflow.
Identifying task components also encourages you to be aware when adding components to your system. These systems are so flexible and human-workflow-like that I often find myself adding components without even realizing it. Yet each new component adds another layer of complexity, another thing that must be evaluated, quality controlled, debugged, and optimized.
As you can see, LLM-based systems can exist at varying levels of complexity. A simple retrieval-augmented generation (RAG) system might only have a few components:
The RAG example above uses the input as a query, injects the retrieval results into the generator, and receives the output. This architecture is used in many proofs of concept, and it can achieve many valuable goals.
A more complex RAG system might have more components:
Please note that this diagram isn’t as complicated as it might be, for example, if you utilized a router to query specific types of documents or data depending on the nature of the query, in so-called “Agentic RAG”. Nevertheless, this workflow diagram is directionally correct for companies that are using LLMs to compile large research reports that synthesize information from available data.
I find most of the online discussions of these workflows focus on the idea that “good systems always incorporate more (complex) components,” such as more refined chunking, more elaborate search systems, additional layers of reranking, etc. I respectfully disagree. You should use the absolute minimum number of components necessary to accomplish your goal. Even if you choose to utilize more components, the best approach is to break down the workflow into as many independent workflows as possible, each of which can be evaluated on its own merits.
By converting a monolithic LLM system into a series of microservices, each of which can be benchmarked and evaluated in isolation, the software system as a whole can be optimized by evaluating each individual component without having to diagnose problems with complicated end-to-end scenarios.
For example, here is how one could break down the complicated RAG workflow into something more manageable:
Phase 1: How good is our system at returning the documents we need to return? Tweak the queries, optimization, document chunking, and retrieval parameters until the initial ranked set of documents contains the most relevant information to our queries for our specific use case toward the top of the results.
Phase 2: How well is our reranking model working? Tweak the reranking model parameters (or just remove the reranking component entirely!) to ensure that the secondary output is reliably the most relevant document(s) to the input queries.
Phase 3: Is the initial model generating the output we expect from the augmentation and conditioning we apply to it? Optimize the augmentation and conditioning (prompt optimization, fine-tuning, preference optimization) until the output matches your expectations. Or return to phase 2 to improve the outputs from the retrieval pipeline.
Phase 4: Are our revisions to and validations of intermediate outputs improving the product? Iterate on the revision and evaluation process until the output meets your expectations. Establish better criteria for evaluating the intermediate draft to improve the back-prompting likely happening within the Model / Generation.
Phase 5: Are our revisions of the final combined output leading to improvement in the quality of the output? Are we improving against our final evaluation benchmarks? If the individual outputs from Phase 4 are good, then any remaining issues have to be with the combination and revision of the final output to be produced. Once again, this would lead us to diagnose an issue with our revision approach, which may have to do with the conditioning we apply to the model, the information we incorporate into the Augmentation or Prompt, or the criteria we utilize for intermediate evaluation while refining the final Output.
Does this seem more complicated? From an infrastructure perspective, it likely is. Each of these phases needs to be hosted independently, and interfaces need to be created between them.
Yet many benefits come with decomposing your workflow into task components and phased workflows:
A microservice approach to LLM-based workflow development gives you the confidence to experiment with and integrate new tools.
Do you want to try a different vector database with joint keyword and vector querying? Great! Swap out your existing database to see how it impacts your preliminary evaluations.
Do you want to change your chunking strategy? No problem! Just alter your component and see how it impacts preliminary evaluations.
Thinking this way assists both with improving your system as a whole and with selecting tools to assist with that improvement.
As a teaser for Part 3 of this series on how to evaluate tools and frameworks for working with LLMs, we believe you should choose tools and frameworks that align with the following principles embodied in the discussion above:
With the principles above, we believe that you can find the LLM tools that set your workflow up for long-term success and maintainability.
Harrison Chase, founder of LangChain, recently proposed these levels of LLM automation in his recent blog (shown in the image below).
My view is that the further you move down this hierarchy, the more likely you are to fail. I suggest you architect your LLM-based system as close to the top of the hierarchy as possible, given your use case, and work your way down. Fully autonomous, LLM-based systems are still a ways off (or a large time + financial investment off), so I would avoid attempting this for now. Building a viable LLM-based state machine is a worthy goal, but not a trivial exercise.
Several of the agentic frameworks, in particular CrewAI, focus on assigning roles to different “agents” so that messages can pass between them and teams of agents can accomplish a particular goal. So, in the case of the complex RAG framework demonstrated above, with multiple iterations of generation, revision, and evaluation, each component could be assigned to a separate “agent” with a separate role. In many ways, this framework is easier to understand because it maps to our experience of working in teams of people who have different roles to accomplish a given outcome.
Yet, this framework requires you to give more control to the LLM as an “agent” than I would recommend giving. Yes, there are methods of incorporating human feedback into the agent’s workflow, but one of the downsides of this paradigm is that the automation component is completely AI-driven rather than controlled using software you own and architect directly. This creates many more opportunities for errors and far fewer levers from which to optimize the system.
Ultimately, I think these systems can work well for research because research is an intermediate output along a larger workflow with a human in the loop. But these systems can consume lots of API tokens with no recourse but to replatform to a microservices-based system with more control when you cap out capabilities.
Several agentic frameworks, such as LangGraph, Burr, and newly released ControlFlow, focus on constructing the DAG / state machine for the LLM-based workflow. I love that this organizational principle (heavily utilized in this blog post) is explicitly embodied in the frameworks, and that the DAG is visualizable as you construct the graph's nodes (tasks or states) and edges (dependencies) along the pathway to achieving the outcome.
And yet, embodying the full DAG within a single framework makes it more challenging to decompose that DAG into its component parts and achieve the benefits discussed above. It's not impossible, just more difficult and subject to the levers and enhancements exposed by the framework's developers.
The reason for this will be the subject of Part 3 of our discussion of LLM-based Tools for Data-Driven Applications: the lightness or heaviness of the abstractions provided by LLM Tools and Frameworks and how that relates to your ability to build maintainable and enhanceable applications on these frameworks confidently.
If you made it this far, I thank you for reading and look forward to your feedback!