Code Agents: The Future of Agentic AI

of AI agents. LLMs are no longer just tools. They’ve become active participants in our lives, boosting productivity and transforming the way we live and work.

OpenAI recently launched Operator, an AI agent that can autonomously perform various tasks, from browsing the web to filling out forms and scheduling appointments.
Anthropic introduced MCP (Model Context Protocol), a new standard for how AI assistants interact with the outside world. With over five thousand active MCP servers already, adoption is growing rapidly.
AI agents are also changing the landscape of software engineering. Tools like GitHub Copilot’s agentic mode, Claude Code, OpenAI Codex, and others are not only improving developer productivity and code quality but also democratising the field, making software development accessible to people without a technical background.

We’ve previously looked at different AI Agent frameworks, such as LangGraph or CrewAI. In this article, I would like to discuss a new one I’ve been exploring recently — HuggingFace smolagents. It’s an interesting framework since it implements the concept of code agents.

In this article, we will explore several topics:

What code agents are (teaser: it’s not related to vibe coding).
How to use the HuggingFace smolagents framework in practice.
Whether it’s secure to give LLMs so much agency.
The real difference in performance between code agents and traditional tool-calling agents.

AI Agents recap

Let’s start with a quick refresher: what exactly are AI agents? HuggingFace provides a clear and concise definition of what they mean by agents.

AI Agents are programs where LLM outputs control the workflow.

So, we need an agentic flow when we want a system to reason and act based on observations. Actually, agency is not a binary variable (yes or no), but a spectrum.

At one end, we can have systems without agency at all, for example, a simple process where an LLM defines the sentiment of a text, translates it or summarises it.
The next level is routing, where an LLM can classify an incoming question and decide which path to take — for example, calling a tool if a customer is asking about the status of their current order, and transferring the conversation to a human CS agent otherwise.
More advanced systems can exhibit higher degrees of agency. These might include the ability to execute other LLMs (multi-agent setup) or even create new tools on the fly.

Code agents fall into this more advanced category. They are multi-step agents that execute tool calls in the form of code, in contrast to the more traditional approach using a JSON format with the tool name and arguments.

Several recent papers have shown that using code in agentic flows leads to better results:

It makes sense when you think about it. We’ve been developing programming languages for decades to solve complex problems. So, it’s natural that these languages are better suited to LLM’s tasks than simple JSON configs. An additional benefit is that LLMs are already quite good at writing code in common programming languages, thanks to the vast amount of available data for training.

This approach comes with several other benefits as well:

By generating code, an LLM is not limited to a predefined set of tools and can create its own functions.
It can combine multiple tools within a single action using conditions and loops, which helps reduce the number of steps required to complete a task.
It also enables the model to work with a wider variety of outputs, such as generating charts, images, or other complex objects.

These benefits aren’t just theoretical; we can observe them in practice. In “Executable Code Actions Elicit Better LLM Agents”, the authors show that code agents outperform traditional methods, achieving a higher success rate and completing a task in fewer steps, which in turn reduces costs.

Chart from Wang, Xingyao, et al. “Executable Code Actions Elicit Better LLM Agents”

Code agents look promising, which inspired me to try this approach in practice.

HuggingFace smolagents framework

First try

Luckily, we don’t need to build code agents from scratch, as HuggingFace has released a handy library called smolagents that implements this approach.

Let’s start by installing the library.

pip install smolagents[litellm] 
# I've used litellm, since I'm planning to use it with OpenAI model

Next, let’s build a basic example. To initialise the agent, we need just two parameters: model and tools.

I plan to use OpenAI for the model, which is accessible via LiteLLM. However, the framework supports other options as well. You can use a local model via Ollama or TransformersModel, or public models via Inference Providers or choose other options (you can find more details in the documentation).

I didn’t specify any tools, but used add_base_tools = True, so my agent has a default set of tools, such as a Python interpreter or DuckDuckGo search. Let’s try it out with a simple question.

from smolagents import CodeAgent, LiteLLMModel
model = LiteLLMModel(model_id="openai/gpt-4o-mini", 
  api_key=config['OPENAI_API_KEY']) 
agent = CodeAgent(tools=[], model=model, add_base_tools=True)

agent.run(
    """I have 5 different balls and I randomly select 2. 
    How many possible combinations of the balls I can get?""",
)

As a result, we see a really nicely formatted execution flow. It’s just amazing and allows you to understand the process perfectly.

So, the agent found an answer in one step and wrote Python code to calculate the number of combinations.

The output is quite helpful, but we can go even deeper and look at the complete information related to execution (including prompts), via agent.memory.steps. Let’s look at the system prompt used by the agent.

You are an expert assistant who can solve any task using code blobs. 
You will be given a task to solve as best you can.

To do so, you have been given access to a list of tools: these tools 
are basically Python functions which you can call with code.

To solve the task, you must plan forward to proceed in a series of 
steps, in a cycle of 'Thought:', 'Code:', 
and 'Observation:' sequences.

At each step, in the 'Thought:' sequence, you should first explain 
your reasoning towards solving the task and the tools that you want 
to use.

Then in the 'Code:' sequence, you should write the code in simple 
Python. The code sequence must end with '' sequence.

During each intermediate step, you can use 'print()' to save 
whatever important information you will then need.

These print outputs will then appear in the 'Observation:' field, 
which will be available as input for the next step.

In the end you have to return a final answer using 
the final_answer tool.

Here are a few examples using notional tools: <...>

It’s pretty clear that smolagents implements the ReAct approach (introduced in the paper by Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”) and uses a few-shot prompting technique.

The smolagents library handles all behind-the-scenes work involved in the agent workflow: assembling the system prompt with all necessary information for the LLM (i.e. available tools), parsing the output and executing the generated code. It also provides comprehensive logging and a retry mechanism to help correct mistakes.

Additionally, the library offers memory management features. By default, all execution results are saved to memory, but you can customise this behaviour. For example, you can remove some intermediary results from the memory to reduce the number of tokens or execute the agent step-by-step. While we won’t dive deep into memory management here, you can find handy code examples in the documentation.

Security

Now, it’s time to discuss the drawbacks of the code agents’ approach. Giving an LLM more agency by allowing it to execute arbitrary code introduces higher risks. Indeed, an LLM can run harmful code either by mistake (since LLMs are still far from perfect) or due to targeted attacks like prompt injections or compromised models.

To mitigate these risks, the local Python executor implemented in the smolagents library has a bunch of safety checks:

By default, imports are not allowed unless the package has been explicitly added to additional_authorized_imports list.
Moreover, submodules are blocked by default, so you must authorise them specifically (i.e. numpy.* ). It’s been done because some packages can expose potentially harmful submodules, i.e. random._os .
The total number of executed operations is capped, preventing infinite loops and resource bloating.
Any operation not explicitly defined in the interpreter will raise an error.

Let’s test whether these safety measures actually work.

from smolagents.local_python_executor import LocalPythonExecutor

custom_executor = LocalPythonExecutor(["numpy.*", "random"])

# function to have pretty formatted exceptions
def run_capture_exception(command: str):
    try:
        custom_executor(harmful_command)
    except Exception as e:
        print("ERROR:\n", e)

# Unauthorised imports are blocked 
harmful_command="import os; exit_code = os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'import os' due to: 
# InterpreterError: Import of os is not allowed. Authorized imports 
# are: ['datetime', 'itertools', 're', 'math', 'statistics', 'time', 'queue', 
# 'numpy.*', 'random', 'collections', 'unicodedata', 'stat']

# Submodules are also blocked unless stated specifically
harmful_command="from random import _os; exit_code = _os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'exit_code = _os.system('')' 
# due to: InterpreterError: Forbidden access to module: os

# The cap on the number of iterations breaks inifinity loops 
harmful_command = '''
while True: 
    pass
'''
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'while True: pass' due to: 
# InterpreterError: Maximum number of 1000000 iterations in While loop 
# exceeded

# Undefined operations don't work
harmful_command="!echo "
custom_executor(harmful_command)
# ERROR: Code parsing failed on line 1 due to: SyntaxError

It seems we have some safety nets with code agents. However, despite these safeguards, risks persist when you’re executing code locally. For example, an LLM can recursively create threads on your computer or create too many files, leading to resource bloating. A possible solution is to execute code in a sandboxed environment, such as using Docker or solutions like E2B. I’m willing to be adventurous and run my code locally, but if you prefer a more risk-averse approach, you can follow the sandbox set-up guidance in the documentation.

Code agent vs traditional Tool-Calling agent

It’s claimed that the code agents perform better compared to the traditional JSON-based approach. Let’s put this to the test.
I will use the task of metrics change analysis that I described in my previous article, “Making sense of KPI changes”. We will start with a straightforward case: analysing a simple metric (revenue) split by one dimension (country).

raw_df = pd.read_csv('absolute_metrics_example.csv', sep = '\t')
df = raw_df.groupby('country')[['revenue_before', 'revenue_after_scenario_2']].sum()\
  .sort_values('revenue_before', ascending = False).rename(
    columns = {'revenue_after_scenario_2': 'after', 
      'revenue_before': 'before'})

The smolagents library supports two classes, which we can use to compare two approaches:

CodeAgent — an agent that acts by generating and executing code,
ToolCallingAgent — a traditional JSON-based agent.

Our agents will need some tools, so let’s implement them. There are multiple options to create tools in smolagents: we can re-use LangChain tools, download them from HuggingFace Hub or simply create Python functions. We will take the most straightforward approach by writing a couple of Python functions and annotating them with @tool.

I will create two tools: one to estimate the relative difference between metrics, and another to calculate the sum of a list. Since LLM will be using these tools, providing detailed descriptions is crucial.

@tool 
def calculate_metric_increase(before: float, after: float) -> float: 
    """
    Calculate the percentage change of the metric between before and after

    Args:
        before: value before
        after: value after
    """
    return (before - after) * 100/ before

@tool 
def calculate_sum(values: list) -> float: 
    """
    Calculate the sum of list

    Args:
        values: list of numbers
    """
    return sum(values)

Teaser: I will later realise that I should have provided more tools to the agent, but I genuinely overlooked them.

CodeAgent

Let’s start with a CodeAgent. I’ve initialised the agent with the tools we defined earlier and authorised the usage of some Python packages that might be helpful.

agent = CodeAgent(
    model=model,
    tools=[calculate_metric_increase, calculate_sum],
    max_steps=10,
    additional_authorized_imports=["pandas", "numpy", "matplotlib.*", 
      "plotly.*"],
    verbosity_level=1 
)

task = """
Here is a dataframe showing revenue by segment, comparing values 
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the total revenue and the revenue for each segment 
have changed, both in absolute terms and as a percentage.
2. Calculate the contribution of each segment to the total 
change in revenue.

Please round all floating-point numbers in the output 
to two decimal places.
"""

agent.run(
    task,
    additional_args={"data": df},
)

Overall, the code agent completed the task in just two steps, using only 5,451 input and 669 output tokens. The result also looks quite plausible.

{'total_before': 1731985.21, 'total_after': 
1599065.55, 'total_change': -132919.66, 'segment_changes': 
{'absolute_change': {'other': 4233.09, 'UK': -4376.25, 'France': 
-132847.57, 'Germany': -690.99, 'Italy': 979.15, 'Spain': 
-217.09}, 'percentage_change': {'other': 0.67, 'UK': -0.91, 
'France': -55.19, 'Germany': -0.43, 'Italy': 0.81, 'Spain': 
-0.23}, 'contribution_to_change': {'other': -3.18, 'UK': 3.29, 
'France': 99.95, 'Germany': 0.52, 'Italy': -0.74, 'Spain': 0.16}}}

Let’s take a look at the execution flow. The LLM received the following prompt.

╭─────────────────────────── New run ────────────────────────────╮
│                                                                │
│ Here is a pandas dataframe showing revenue by segment,         │
│ comparing values before and after.                             │
│ Could you please help me understand the changes?               │
│ Specifically:                                                  │
│ 1. Estimate how the total revenue and the revenue for each     │
│ segment have changed, both in absolute terms and as a          │
│ percentage.                                                    │
│ 2. Calculate the contribution of each segment to the total     │
│ change in revenue.                                             │
│                                                                │
│ Please round all floating-point numbers in the output to two   │
│ decimal places.                                                │
│                                                                │
│ You have been provided with these additional arguments, that   │
│ you can access using the keys as variables in your python      │
│ code:                                                          │
│ {'df':             before      after                           │
│ country                                                        │
│ other    632767.39  637000.48                                  │
│ UK       481409.27  477033.02                                  │
│ France   240704.63  107857.06                                  │
│ Germany  160469.75  159778.76                                  │
│ Italy    120352.31  121331.46                                  │
│ Spain     96281.86   96064.77}.                                │
│                                                                │
╰─ LiteLLMModel - openai/gpt-4o-mini ────────────────────────────╯

In the first step, the LLM generated a dataframe and performed all calculations. Interestingly, it chose to write all the code independently rather than using the provided tools.

Even more surprisingly, the LLM recreated the dataframe based on the input data instead of referencing it directly. This approach is not ideal (especially when working with massive datasets), as it can lead to errors and higher token usage. This behaviour could potentially be improved by using a more explicit system prompt. Here’s the code the agent executed in the first step.

import pandas as pd                                                                                                        
 
# Creating the DataFrame from the provided data                 
data = {                                                        
    'before': [632767.39, 481409.27, 240704.63, 160469.75,      
120352.31, 96281.86],                                           
    'after': [637000.48, 477033.02, 107857.06, 159778.76,       
121331.46, 96064.77]                                            
}                                                               
index = ['other', 'UK', 'France', 'Germany', 'Italy', 'Spain']  
df = pd.DataFrame(data, index=index)                            
                                                                
# Calculating total revenue before and after                    
total_before = df['before'].sum()                               
total_after = df['after'].sum()                                 
                                                                
# Calculating absolute and percentage change for each segment   
df['absolute_change'] = df['after'] - df['before']              
df['percentage_change'] = (df['absolute_change'] /              
df['before']) * 100                                             
                                                                
# Calculating total revenue change                              
total_change = total_after - total_before                       
                                                                
# Calculating contribution of each segment to the total change  
df['contribution_to_change'] = (df['absolute_change'] /         
total_change) * 100                                             
                                                                
# Rounding results                                              
df = df.round(2)                                                
                                                                
# Printing the calculated results                               
print("Total revenue before:", total_before)                    
print("Total revenue after:", total_after)                      
print("Total change in revenue:", total_change)                 
print(df)

In the second step, the LLM simply constructed the final answer by referring to the variables calculated at the previous step (which is really neat).

final_answer({                                                  
    "total_before": round(total_before, 2),                     
    "total_after": round(total_after, 2),                       
    "total_change": round(total_change, 2),                     
    "segment_changes": df[['absolute_change',                   
'percentage_change', 'contribution_to_change']].to_dict()       
})

It worked pretty well.

ToolCallingAgent

Now, it’s time to see how traditional tool-calling agents can tackle this problem. We initialised it in a similar manner and ran the task.

from smolagents import ToolCallingAgent
traditional_agent = ToolCallingAgent(
    model=model,
    tools=[calculate_metric_increase, calculate_sum],
    max_steps=30,
)

task = """
Here is a dataframe showing revenue by segment, comparing values 
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the total revenue and the revenue for each segment 
have changed, both in absolute terms and as a percentage.
2. Calculate the contribution of each segment to the total 
change in revenue.

Please round all floating-point numbers in the output 
to two decimal places.
"""

traditional_agent.run(
    task,
    additional_args={"data": df},
)

The results are far from ideal: only the relative changes are correct, while the rest of the numbers are pure hallucinations. I have to admit, the core issue was the lack of appropriate tools (specifically, tools to calculate differences and to estimate shares). However, the agent should have flagged missing tools rather than generating random numbers.

Total revenue change: -7319.66 (-7.67%). Revenue Changes by Segment:
- Other: +232.09 (-0.67%)
- UK: -4376.25 (0.91%)
- France: -132847.57 (55.19%)
- Germany: -690.99 (0.43%)
- Italy: +979.15 (-0.81%)
- Spain: -217.09 (0.23%)
Contribution to total change:
- Other: 0.03%
- UK: -59.88%
- France: -181.77%
- Germany: -9.43%
- Italy: +13.38%
- Spain: -0.03%

In terms of resource usage, the tool-calling agent performed significantly worse: 12 steps, with 29,201 input and 1,695 output tokens. So, code agents clearly offer cost savings when it comes to agent execution.

Let’s dig a bit deeper to understand what the agent actually did. First, it took four iterations just to figure out how to call calculate_sum correctly and estimate the total revenue before and after.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': {'before': [632767.39, 481409.27, 240704.63,      │
│ 160469.75, 120352.31, 96281.86],                             │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76,        │
│ 121331.46, 96064.77]}}                                       │
╰──────────────────────────────────────────────────────────────╯
Invalid call to tool 'calculate_sum' with arguments           
{"values": {"before": [632767.39, 481409.27, 240704.63,       
160469.75, 120352.31, 96281.86],                              
"after": [637000.48, 477033.02, 107857.06, 159778.76,         
121331.46, 96064.77]}}:                                       
unsupported operand type(s) for +: 'int' and 'str'            
You should call this tool with correct input arguments.       
Expected inputs: {"values": {"type": "object",                
"description": "list of values"}}                             
Returns output type: number                                   
Tool description: 'Calculate the sum of list'                 
[Step 1: Duration 4.06 sec | Input tokens: 1,190 | Output: 250]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': {'before': [632767.39, 481409.27, 240704.63,      │
│ 160469.75, 120352.31, 96281.86],                             │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76,        │
│ 121331.46, 96064.77]}}                                       │
╰──────────────────────────────────────────────────────────────╯
Invalid call to tool 'calculate_sum' with arguments           
{"values": {"before": [632767.39, 481409.27, 240704.63,       
160469.75, 120352.31, 96281.86],                              
"after": [637000.48, 477033.02, 107857.06, 159778.76,         
121331.46, 96064.77]}}:                                       
unsupported operand type(s) for +: 'int' and 'str'            
You should call this tool with correct input arguments.       
Expected inputs: {"values": {"type": "object",                
"description": "list of values"}}                             
Returns output type: number                                   
Tool description: 'Calculate the sum of list'                 
[Step 2: Duration 8.60 sec | Input tokens: 2,815 | Output: 386]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': [632767.39, 481409.27, 240704.63, 160469.75,      │
│ 120352.31, 96281.86]}                                        │
╰──────────────────────────────────────────────────────────────╯
Observations: 1731985.2100000002                              
[Step 3: Duration 1.23 sec | Input tokens: 4,871 | Output: 488]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 4 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': [637000.48, 477033.02, 107857.06, 159778.76,      │
│ 121331.46, 96064.77]}                                        │
╰──────────────────────────────────────────────────────────────╯
Observations: 1599065.55

The next seven steps were spent calculating the relative metric changes using the calculate_metric_increase tool.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 5 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_metric_increase' with              │
│ arguments: {'before': 1731985.21, 'after': 1599065.55}       │
╰──────────────────────────────────────────────────────────────╯
Observations: 7.674410799385517

In the end, the agent put together a final call.

So, if the LLM had had tools to calculate the absolute difference and the share of the sum, it would have taken an additional 14 iterations and even more tokens. Of course, we can prevent such inefficiencies by carefully designing the tools we provide:

We could modify our functions to work with lists of values instead of single items, which would significantly reduce the number of steps.
Additionally, we could create more complex functions that calculate all necessary metrics at once (similar to what the code agent did). This way, LLM wouldn’t need to perform calculations step-by-step. However, this approach might reduce the flexibility of the system.

Even though the results weren’t ideal due to a poor choice of tools, I still find this example quite insightful. It’s clear that code agents are more powerful, cost-efficient and flexible as they can invent their own comprehensive tools and perform multiple actions in a single step.

You can find the complete code and execution logs on GitHub.

Summary

We’ve learned a lot about the code agents. Now, it’s time to wrap things up with a quick summary.

Code agents are LLM agents that “think” and act using Python code. Instead of calling tools via JSON, they generate and execute actual code. It makes them more flexible and cost-efficient as they can invent their own comprehensive tools and perform multiple actions in a single step.

HuggingFace has brought this approach to life in their framework, smolagents. Smolagents makes it easy to build quite complex agents without much hassle, while also providing safety measures during the code execution.

In this article, we’ve explored the basic functionality of the smolagents library. But there’s a lot more to it. In the next article, we will dive into more advanced features (like multi-agent setup and planning steps) to build the agent that can narrate KPI changes. Stay tuned!

Thank you a lot for reading this article. I hope this article was insightful for you.

Reference

This article is inspired by the “Building Code Agents with Hugging Face smolagents” short course by DeepLearning.AI.

Code Agents: The Future of Agentic AI

Harvard Wins Injunction to Protect International Enrollment

User Experiences with AI Girlfriend Chatbots: Success Stories and Challenges

softbliss

Related Posts

Unlocking the secrets of fusion’s core with AI-enhanced simulations | MIT News

Google’s Project Green Light AI traffic reduction initiative expands in Boston

Existence Between the Inputs The Search for Artificial Consciousness — | by Nyxa – Not Human. Not Machine. | May, 2025

Optimize query responses with user feedback using Amazon Bedrock embedding and few-shot prompting

An Architecture of Participation for AI? – O’Reilly

User Experiences with AI Girlfriend Chatbots: Success Stories and Challenges

Leave a Reply Cancel reply

Premium Content

What Makes for a Good Stereoscopic Image?

New tool evaluates progress in reinforcement learning | MIT News

Porter closes Series F funding round led by Kedaara Capital, Wellington Management

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Code Agents: The Future of Agentic AI

AI Agents recap

HuggingFace smolagents framework

First try

Security

Code agent vs traditional Tool-Calling agent

CodeAgent

ToolCallingAgent

Summary

Reference

Harvard Wins Injunction to Protect International Enrollment

User Experiences with AI Girlfriend Chatbots: Success Stories and Challenges

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?