How to implement your own ChatGPT Code Interpreter

What is Code Interpreter?#

After the introduction of ChatGPT Code Interpreter some time ago, everyone should have some understanding of what it is and what it can do, so we won't repeat these basic questions here. This time, let's look at how to understand Code Interpreter from the perspective of processes and elements.

Mermaid Loading...

In a typical ChatGPT interaction, the process and elements are: Prompt => Output Text. This is also why the concept of Prompt Engineering emerged immediately after ChatGPT was launched, along with the embedding work for constructing prompts, because its process is short and its elements are simple. Constructing a good prompt is the key in this process. For Code Interpreter, its process and elements are as follows: Prompt => Code ==Interpreter==> Output(Description, Image, File...). This brings some changes:

Prompt construction is no longer directly aimed at output but at generating intermediate code.
A code interpreter is needed to distinguish sessions, execute code, save intermediate variables, etc.
Output becomes more diverse, which can include images, files, etc.

Why Implement Code Interpreter#

ChatGPT has already implemented Code Interpreter, relying on OpenAI's GPT model, and its capabilities are quite powerful. Why do we still need to implement our own Code Interpreter? Apart from aligning with industry leaders and integrating internal model capabilities, we can also think about what incremental benefits we might gain from implementing it ourselves. Some typical increments include:

Ability to interact with real-time data: ChatGPT's Code Interpreter does not have the networking capabilities of Plugins. Once Code Interpreter is enabled, plugins cannot be selected, which leads to insufficient real-time data for ChatGPT Code Interpreter, making it impossible to do things like "plot the stock performance of Apple in 2023."
Ability to interact with more environments: After local or cloud deployment, a more flexible environment is available, whether it's operating the file system, calling APIs, or installing packages that ChatGPT Code Interpreter does not support, all become possible.

Thoughts on Implementing Code Interpreter#

To implement Code Interpreter, there are two core focuses: one is to leverage model capabilities, such as generating the code to be called using OpenAI API's Function Calling capability; the other is to have an environment that can execute Python code. For example, if the user specifies that they want to output a sine function graph, then we need to obtain the code to plot the sine function graph, send it to the Python interpreter for execution, output the image, and then display it to the user. In this process, the LLM Agent may also need to provide some explanations and details about the results. Additionally, file I/O and the ability to save variables in the session need to be considered.

If we use LangChain to implement it, it will be more convenient. Here, the Python interpreter, file I/O, and variable saving can all be seen as a Tool in LangChain, which can be incorporated into the LangChain Agent Executor for invocation. Following this idea, the community already has open-source implementations: codebox-api, which can be registered as a LangChain Tool. In addition to providing the core capability of code execution, we also need some surrounding components to implement: session management, initiating Kernel calls, file I/O, meaning that every time a new session is created, a new Jupyter Kernel channel is created to execute the code, and then the execution results are categorized and fed back to the user based on the output type. The author of the above codebox-api has also packaged this part into a solution: codeinterpreter-api.

Design of Implementing Code Interpreter#

Next is the breakdown of the codeinterpreter-api project, looking at how to design and implement the above ideas. Since the project mainly uses LangChain to orchestrate the entire process, here are some basic concepts used in the LangChain part of the project:

Some Basic Concepts in LangChain:#

LangChain Agents: A foundational module in LangChain, the core idea is to use LLM to select a series of actions to take. Unlike hard-coded action sequences in chains, agents use language models as reasoning engines to decide which actions to take and in what order.
LangChain Tools: Tools are the capabilities called by Agents. Two main aspects need to be considered: providing the correct tools to the Agent and describing these tools in a way that is most helpful to the Agent. When creating a custom StructuredTool for Code Interpreter, it is necessary to define: name, description, func (synchronous function), coroutine (asynchronous function), args_schema (input schema).
LangChain Agent Executor: The Agent Executor is the runtime for Agents. It actually calls the Agent and executes the actions it selects. This executor also does some additional work to reduce complexity, such as handling situations where the Agent selects a non-existent Tool, Tool errors, or the Agent produces output that cannot be resolved as a Tool call.

Process Design and Implementation#

With the above basic concepts in place, we can look at how to implement Code Interpreter based on LangChain Agent. Let's examine the specific execution process through the following code:

from codeinterpreterapi import CodeInterpreterSession, File

async def main():
    # context manager for start/stop of the session
    async with CodeInterpreterSession(model="gpt-3.5-turbo") as session:
        # define the user request
        user_request = "Analyze this dataset and plot something interesting about it."
        files = [
            File.from_path("examples/assets/iris.csv"),
        ]
        # generate the response
        response = await session.generate_response(user_request, files=files)
        # output the response (text + image)
        response.show()

if __name__ == "__main__":
    import asyncio
    # run the async function
    asyncio.run(main())

The effect is as shown in the image:

Kapture

Execution Environment and Tool Instantiation#

When creating a session with with, we need to start instantiating the Jupyter kernel and the agent executor. Here are some key steps:

Create a service to communicate with the Jupyter kernel through jupyter-kernel-gateway and check the status of successful startup.

self.jupyter = await asyncio.create_subprocess_exec(
	python,
	"-m",
	"jupyter",
	"kernelgateway",
	"--KernelGatewayApp.ip='0.0.0.0'",
	f"--KernelGatewayApp.port={self.port}",
	stdout=out,
	stderr=out,
	cwd=".codebox",
)
self._jupyter_pids.append(self.jupyter.pid)

# ...
while True:
	try:
		response = await self.aiohttp_session.get(self.kernel_url)
		if response.status == 200:
			break
	except aiohttp.ClientConnectorError:
		pass
	except aiohttp.ServerDisconnectedError:
		pass
	if settings.VERBOSE:
		print("Waiting for kernel to start...")
	await asyncio.sleep(1)
await self._aconnect()

Specify stdout and stderr, while recording the process pid and associating this kernel instance with the session. After the kernel is created, send an HTTP request to establish a websocket connection with the kernel.

Create Agent Executor

def _agent_executor(self) -> AgentExecutor:
	return AgentExecutor.from_agent_and_tools(
		agent=self._choose_agent(),
		max_iterations=9,
		tools=self.tools,
		verbose=self.verbose,
		memory=ConversationBufferMemory(
			memory_key="chat_history",
			return_messages=True,
			chat_memory=self._history_backend(),
		),
	)

def _choose_agent(self) -> BaseSingleActionAgent:
	return (
		OpenAIFunctionsAgent.from_llm_and_tools(
			llm=self.llm,
			tools=self.tools,
			system_message=code_interpreter_system_message,
			extra_prompt_messages=[
				MessagesPlaceholder(variable_name="chat_history")
			],
		)
		# ...
	)

def _tools(self, additional_tools: list[BaseTool]) -> list[BaseTool]:
	return additional_tools + [
		StructuredTool(
			name="python",
			description="Input a string of code to a ipython interpreter. "
			"Write the entire code in a single string. This string can "
			"be really long, so you can use the `;` character to split lines. "
			"Variables are preserved between runs. ",
			func=self._run_handler, # Call CodeBox for synchronous execution
			coroutine=self._arun_handler, # Call CodeBox for asynchronous execution
			args_schema=CodeInput,
		),
	]

Here, we define the use of OpenAIFunctionsAgent, which we can replace with our own Agent if needed. However, currently, only OpenAI's API has convenient and powerful Function Calling capabilities, so we use this as an example. Here, we also specify the Tool that the Agent and Agent Executor will use, which includes the name and description as well as other parameters for executing Python code, where the Jupyter kernel instance created in the previous step is further encapsulated by CodeBox and passed to the Tool as synchronous and asynchronous invocation methods.

Handling Input Text and Files#

Since Prompt Engineering is an external step, users should construct the Prompt before passing it in. Therefore, this step does not do much work; it simply appends the incoming text and files to the Prompt (for example, specifying which files the user wants to use), while also recording the files in the CodeBox instance for subsequent execution.

class UserRequest(HumanMessage):
    files: list[File] = []

    def __str__(self):
        return self.content

    def __repr__(self):
        return f"UserRequest(content={self.content}, files={self.files})"

def _input_handler(self, request: UserRequest) -> None:
	"""Callback function to handle user input."""
	if not request.files:
		return
	if not request.content:
		request.content = (
			"I uploaded, just text me back and confirm that you got the file(s)."
		)
	request.content += "\n**The user uploaded the following files: **\n"
	for file in request.files:
		self.input_files.append(file)
		request.content += f"[Attachment: {file.name}]\n"
		self.codebox.upload(file.name, file.content)
	request.content += "**File(s) are now available in the cwd. **\n"

Execution and Result Handling#

Through the Agent Executor, we can achieve automatic conversion from prompt to code. Let's take a look at how this code executes:

def _connect(self) -> None:
	response = requests.post(
		f"{self.kernel_url}/kernels",
		headers={"Content-Type": "application/json"},
		timeout=90,
	)
	self.kernel_id = response.json()["id"]
	if self.kernel_id is None:
		raise Exception("Could not start kernel")

	self.ws = ws_connect_sync(f"{self.ws_url}/kernels/{self.kernel_id}/channels")

First, we need to connect to a specific kernel through websocket,

self.ws.send(
	json.dumps(
		{
			"header": {
				"msg_id": (msg_id := uuid4().hex),
				"msg_type": "execute_request",
			},
			"content": {
				"code": code,
				# ...
			},
			# ...
		}
	)
)

Then, we send the code to the kernel for execution via websocket,

while True:
    # ...
	if (
		received_msg["header"]["msg_type"] == "stream"
		and received_msg["parent_header"]["msg_id"] == msg_id
	):
		msg = received_msg["content"]["text"].strip()
		if "Requirement already satisfied:" in msg:
			continue
		result += msg + "\n"
		if settings.VERBOSE:
			print("Output:\n", result)

	elif (
		received_msg["header"]["msg_type"] == "execute_result"
		and received_msg["parent_header"]["msg_id"] == msg_id
	):
		result += received_msg["content"]["data"]["text/plain"].strip() + "\n"
		if settings.VERBOSE:
			print("Output:\n", result)

	elif received_msg["header"]["msg_type"] == "display_data":
		if "image/png" in received_msg["content"]["data"]:
			return CodeBoxOutput(
				type="image/png",
				content=received_msg["content"]["data"]["image/png"],
			)
		if "text/plain" in received_msg["content"]["data"]:
			return CodeBoxOutput(
				type="text",
				content=received_msg["content"]["data"]["text/plain"],
			)
		return CodeBoxOutput(
			type="error",
			content="Could not parse output",
		)

Then, we handle a series of returns in the channel message:

msg_type: stream, if msg contains "Requirement already satisfied:", append the output content and continue waiting for ws to return.
msg_type: execute_result, append msg["content"]["data"]["text/plain"] to the output content and continue waiting.
msg_type: display_data, get msg["content"]["data"], if there is image/png, wrap it in CodeBoxOutput and return it; if it is text/plain, return it similarly. Otherwise, return an error type output indicating that the output result could not be parsed.
msg_type: status, execution_state: idle: code execution succeeded but there is no output.
msg_type: error: report the error directly.

Output Results#

After obtaining the above CodeBoxOutput output, we can process the output: for text-type output, no additional processing is needed, as it has already been output through stdout during execution; for file system operations, since they are done directly through Jupyter, no additional processing is needed in the framework, and they will automatically fall back to local files until a description or explanation of the output files is needed, at which point additional processing is required. For image-type output, the returned image base64 will be saved in the session's out_files during execution, and when output processing is done, it will be converted to the standard Image type in Python and then displayed using IPython's display method.

def get_image(self):
    # ...
	img_io = BytesIO(self.content)
	img = Image.open(img_io)

	# Convert image to RGB if it's not
	if img.mode not in ("RGB", "L"):  # L is for greyscale images
		img = img.convert("RGB")

	return img

def show_image(self):
	img = self.get_image()
	# Display the image
	try:
		# Try to get the IPython shell if available.
		shell = get_ipython().__class__.__name__  # type: ignore
		# If the shell is in a Jupyter notebook or similar.
		if shell == "ZMQInteractiveShell" or shell == "Shell":
			from IPython.display import display  # type: ignore

			display(img)
		else:
			img.show()
	except NameError:
		img.show()

Considerations for Internal Implementation#

If we want to implement Code Interpreter internally within the company, here are some points we can focus on:

Service-oriented or solution-oriented
1. Provide basic execution capabilities or components for existing platform modules.
2. Provide relevant solutions to teams that need to implement Code Interpreter internally.
Integrate internal models.
Integrate internal systems and environments to achieve automatic API calls.
Support an open technology stack that is not limited to LangChain.

Final Thoughts#

This time we only went through the implementation plan of Code Interpreter from a general process and design perspective, but there are many small but important details that have not been discussed, such as: how to output on the web instead of locally, how to feedback to the model to regenerate code when execution errors occur, how to automatically install missing dependency packages and then re-execute, etc. These are all things that a robust Code Interpreter must consider and implement. The community's solution is currently also an MVP version, and the handling and consideration of edge cases still have some gaps from actual production applications. To implement a complete and production-ready Code Interpreter, there is still a long way to go, and we need to get our hands a bit dirtier.