Building Agentic workflows for code generation: Sharing the journey

Reading Time: 5 minutes

Agentic workflows were just introduced, but they seem to evolve on a daily basis with new tools and architectures every day, as we learn to leverage them in real world applications. These are my insights from Anima’s ongoing journey to automate Front-end engineering with AI design to code solutions.

At Anima, we’ve been working on various AI flows during the last few years. Our workflow evolved in a fairly typical manner, but it might be worthwhile to follow it, particularly if you’re concerned with going with agentic workflows or if you are unclear about the implications of it.

I have to admit that I started with a strong notion that agentic workflows were being overhyped. Too many companies rely on this type of hype, and it didn’t “click” that this is a fundamental change to how we communicate with LLMs or why the results were so different from just prompt engineering away a problem…

When we started, GPT3.5/4 was the only reasonably intelligent LLM, so it made sense to tailor the workflow to these two LLMs. Like everyone else, we struggled with short context windows, costs, and limited output token limitations. We provided insufficient context, then too much context, experimented with RAG and LangChain. Ultimately, what worked for us was a custom framework (internally codenamed “Hera”) that utilized a mixture of our 7years of heuristics algorithms paired with the LLM, which ultimately generated a 1+1=3 result. This produced the best code results from a figma design, balancing code quality with Fidelity (how accurate the result corresponded to the original design) – according to X and our 1.5 million Figma plugin installations. This also allowed us to partner with the top coding agents: Bolt (Stackblitz) and Replit, which currently use our Code engine.

Our next offering presented a much more sophisticated challenge and workflow and required a more flexible architectural approach. This API enables vibe coding companies and users to modernize and convert both public-facing as well as internal webpages into modern, high-fidelity React code. This allows users to modernize, ideate and prototype over existing designs instead of starting from a prompt or an image. When approaching this challenge, we needed a much more flexible approach, mixing various LLMs with some of our existing algorithms (Some required adjustments), letting them iterate on particular sub-problems, and giving them smarter, more relevant context. This was even more evident when working on our Prompt to code API: Flexibility and variability were the key decision factors here.

For that purpose, we built the new flow based on LangGraph, OpenRouter, LangSmith, and a host of additional tools that let us forgo developing the Developer Experience while retaining observability and increasing our flexibility.

The idea is to break down the general problem into subtasks, carefully identifying which ones should or could run in parallel (some heuristic-based, some LLM-based), and what subtasks have different requirements, needs, and specific data or tools to run. Different jobs need different prompts and different levels of creativity or expertise – which often means different models. Some LLMs are faster, some are more creative, and some are better at particular tasks like coding. If we have a creative task, we don’t want to mix it with logical, actionable tasks, so these need different steps, different contexts, and different models. Quick tasks can run using a mini/nano model, and review tasks need less intelligence than creative tasks, so they can use models that specialize in that. We were surprised, but not only does this approach produce better overall results, but it often cuts costs and runs faster!

By splitting the work, we not only gain speed and reduce costs, but we can also identify wrong turns more easily, then intercept and correct intermediate steps by introducing heuristic (or LLM-based audits) for validations. These ensure things are running smoothly, can score, critique, and even augment the results. For example, steps that generate a particular spec could take a wrong turn or forget something, so we can add a validation step right after them using a fast LLM that identifies if something was missed or forgotten. The role of the first step is an “expert spec writer”, whereas the role of the second step is to be a “quality control expert”, “Accessibility expert”, or an “auditor”. This role change lets it detect and augment any missing aspects that a single step can’t do effectively.

This is similar to how we work in real life. When we write a specification or a PRD, we’ll often ask someone else to review, or switch hats and re-read our document, but this time act as a “reviewer” of sorts and correct and add things we missed on the first round.

The results of the various steps are essentially entered into the Graph’s state, which means the spec and its auditing can both be sent to the next step. This observability lets us identify when and where the flow took a wrong turn, correct, debug, and retry until the results are improved. This agentic approach also lets us give more energy to the planning stages – we don’t let the agents plan something and other agents just implement them, we add review and heuristic checks along the way to ensure what we’re getting is both correct and reasonable.

Most agentic graphs are top-down, which typically means you have planning and thinking stages first, then implementation stages (where an agent or agents take in the results of the previous planning stages and actually try to implement them using the tools provided by the system), validation, correction, and evaluation steps. Each step should be implemented by a model best suited for the job (OpenRouter is an amazing way to switch between LLMs during the evaluation & scoring phases of the project). We want to refrain from overwhelming the models, so use context wisely – each step should include just the necessary context it needs from previous steps in order to function. The same goes for tools – if we overwhelm the model with too many tool options, it becomes dumber and might start acting out. Tools must validate that the LLM is not overreaching and be secured. If we give a model a tool capable of deleting a file or folder, it could very well not just delete the project, but the whole VM it’s running on. By planning out the tool’s ability and identifying “catastrophic” worst-case scenarios, we can limit the tools with the security rights or even design the tools around these limitations.

From a security perspective, having the load-bearing LLMs lower in the Graph also isolates them better and lets us protect the system against prompt injection, poison pills, and other jailbreaking attacks as more layers are found between the user’s data and the “smart LLMs” that are doing the heavy lifting. Those intermediate steps should be aware that they are acting as buffers so that they can identify bad actors and bad intentions as early as possible in the flows.

The most complex use-cases are flows where we have iteration steps that need to be looped, or even when we need to drill in which might mean that agents would need to recursively iterate to solve the particular problem. This is where observability becomes harder, as agents start to need a shared memory space to share and cooperate; it’s also hard to ascertain when or how the iteration/recursion should stop. How “deep” is “deep enough” is a hard question to answer, and we definitely can’t trust the LLMs to make that decision for now – they care less about runtimes or runaway LLM costs. For those, we currently implement hard limits and heuristic guardrails that prevent the models from overanalyzing and overworking a problem. Sub-agents help us reduce complexity by adding alternate paths for particular coding tasks that are similar, but not identical.

To be honest, the results are mind-blowing. We’re mostly constrained right now with costs and runtimes, but as models become smarter and faster, I can definitely see where we can continually improve and add more steps, more decision-making on iterations, and fewer constraints. My original concerns about unwieldy, hard to trace, results were incorrect. Not only can we understand better what the models are doing, we can intervene and intercept wrong turns faster, and costs and runtimes are down from the days when a huge prompt resulted in a huge output result, with no visibility and traceability.

It is a brave new world, called “software 3.0”, where prompts are an integral part of the code. Given how consequential these interchangeable parts are, new design patterns open up the need for new agentic workflows and new tools for building, monitoring and tracing these architectures.

agentic

Building Agentic workflows for code generation: Sharing the journey7 min read

Ofer LaOr

VP Engineering

Cancel reply

Building Agentic workflows for code generation: Sharing the journey7 min read