Multi-Modal Chatbots using Langchain Framework

6 min readJul 1, 2024

The importance of agents in Generative AI. Agents improve Large Language Models (LLMs) by enabling them to access external data and perform complex tasks. The author built a multi-modal chatbot as an example. This chatbot leverages LangChain, ChatGPT, DALL·E 3, and Streamlit to provide a user interface.

What is Langchain?

LangChain: Step-by-Step Guide to Building a Custom-Knowledge Chatbot

In this article, I will introduce LangChain and demonstrate how it’s being utilized alongside OpenAI’s API to develop…

medium.com

Challenge

The challenge involves integrating real-world information into tasks that exceed the training scope of Large Language Models (LLMs). This includes accessing proprietary APIs, introducing new data formats like files or images, and fostering discussions based on this novel data. The agent aims to break down these complex tasks into smaller, manageable steps, determining which tools to use and in what sequence to achieve the desired outcomes effectively.

The Role of Agents in Addressing This Challenge

In the LangChain framework, agents play a crucial role by utilizing tools such as external APIs, Google searches, and image generation to effectively address challenges. Here’s how agents function: Upon receiving a user task or query, the agent interacts with the Language Model (LLM) to break down the task into smaller steps. It then selects and activates the necessary tools to generate outputs, which are subsequently analyzed by the LLM. This iterative process of reasoning and tool activation continues until the problem is fully resolved and a solution is provided to the user.

The Multi-Modal Chatbot Architecture

The diagram illustrates the structure of the multi-modal chatbot system:

Prompt Refinement. The initial user prompt and context of the conversation history are forwarded to the LLM (in this scenario, ChatGPT) to refine the prompt into a more precise query.
Thought Process. The agent relays the refined prompt, alongside any optional tools, to the LLM for reasoning. Based on this, it decides which tool to employ. If the final answer is determined at this stage, it is directly communicated to the user.
Tool Invocation. The agent executes the chosen tool.
Observation. The output generated by the tool is sent back to the LLM by the agent for further reasoning.

Development

Here’s a project structure:

multiple-model-chatbot/
├── src/
│   ├── agents/
│   │   └── conversational_agent.py
│   └── tools/
│       ├── countries_image_generator.py
│       ├── get_countries_by_name.py
│       └── google_search.py
├── .env
├── .gitignore
├── Home.py
├── LICENSE
├── README.md
└── requirements.txt

Agents

Each directory and file in this structure serves a specific purpose:

src/: Source code directory.
agents/: Contains agent-related code.
conversational_agent.py: Python script for the conversational agent.

I developed the entire chatbot in Python; here is a code snippet of the agent creation:

from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.utils.function_calling import convert_to_openai_function
from langchain_openai.chat_models import ChatOpenAI
from src.tools.countries_image_generator import countries_image_generator
from src.tools.get_countries_by_name import get_countries_by_name
from src.tools.google_search import google_search
import os
from dotenv import load_dotenv
load_dotenv()

def create_agent():
    tools = [countries_image_generator, get_countries_by_name, google_search]
    functions = [convert_to_openai_function(f) for f in tools]
    model = ChatOpenAI(
        model_name=os.getenv("OPENAI_MODEL_NAME"),
        temperature=0
    ).bind(functions=functions)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are helpful but sassy assistant"),
        MessagesPlaceholder(variable_name="chat_history"),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad")
    ])
    memory = ConversationBufferWindowMemory(
        return_messages=True, memory_key="chat_history", k=5)
    chain = RunnablePassthrough.assign(
        agent_scratchpad=lambda x: format_to_openai_functions(
            x["intermediate_steps"])
    ) | prompt | model | OpenAIFunctionsAgentOutputParser()
    agent_executor = AgentExecutor(
        agent=chain, tools=tools, memory=memory, verbose=True)
    return agent_executor

Tools

tools/: Contains utility scripts.
countries_image_generator.py: This tool allows your chatbot to retrieve structured data about countries.

from langchain.tools import tool
from langchain_community.utilities.dalle_image_generator import DallEAPIWrapper


@tool
def countries_image_generator(country: str):
    """Call this to get an image of a country"""
    res = DallEAPIWrapper(model="dall-e-3").run(f"You generate image of a country representing the most typical country's characteristics,\
        incorporating its flag. the country is {country}")

    answer_to_agent = (
        f"Use this format- Here is an image of {country}: [{country} Image]"f"url= {res}")
    return answer_to_agent

get_countries_by_name.py: This AI-powered tool generates images based on textual descriptions.

from typing import Optional

import requests
from langchain.tools import tool
from pydantic.v1 import BaseModel, Field, conlist
from requests import PreparedRequest


def prepare_and_log_request(base_url: str, params: Optional[dict] = None) -> PreparedRequest:
    """Prepare the request and log the full URL."""
    req = PreparedRequest()
    req.prepare_url(base_url, params)
    print(f'\033[92mCalling API: {req.url}\033[0m')
    return req


class Params(BaseModel):
    fields: Optional[conlist(str, min_items=1, max_items=27)] = Field( # type: ignore
        default=None, description='Fields to filter the output of the request.', examples=[
            "name", "topLevelDomain", "alpha2Code", "alpha3Code", "currencies", "capital", "callingCodes", "altSpellings", "region",
            "subregion", "population", "latlng", "demonym", "area", "gini", "timezones", "borders", "nativeName", "numericCode",
            "languages", "flag", "regionalBlocs", "cioc"
        ])


class PathParams(BaseModel):
    name: str = Field(..., description='Name of the country')


class RequestModel(BaseModel):
    params: Optional[Params] = None
    path_params: PathParams


@tool(args_schema=RequestModel)
def get_countries_by_name(path_params: PathParams, params: Optional[Params] = None):
    """Useful for when you need to answer questions about countries. Input should be a fully formed question."""
    base_url = f'https://restcountries.com/v3.1/name/{path_params.name}'

    effective_params = {"fields": ",".join(
        params.fields)} if params and params.fields else None

    req = prepare_and_log_request(base_url, effective_params)

    # Make the request
    response = requests.get(req.url)

    # Raise an exception if the request was unsuccessful
    response.raise_for_status()

    return response.json()

google_search.py: Integrating Google search capabilities enables your chatbot to fetch up-to-date and extensive information from the web.

from langchain.tools import tool
from langchain_community.utilities import SerpAPIWrapper

@tool
def google_search(query: str):
    """Performs a Google search using the provided query string. Choose this tool when you need to find current data"""
    return SerpAPIWrapper().run(query)

.env: Environment variables file.

OPENAI_API_KEY="Enter api key"
OPENAI_MODEL_NAME=gpt-3.5-turbo
SERPAPI_API_KEY="Enter api ke

Streamlit UI

Home.py: Main entry point for the project.

import re
import streamlit as st
from dotenv import load_dotenv
from PIL import Image
from src.agents.conversational_agent import create_agent

load_dotenv('.env')


def display_header_and_image():
    """
    Displays the header information for the chatbot and an image.
    """

    st.markdown('# Multi-Modal Chatbot')
    st.markdown(
        'Powered by Langchain Agents, OpenAI Function Calling, and Streamlit')
    image = Image.open('images/ref.png')
    width, height = image.size
    image = image.resize((width // 2, height // 2))
    st.sidebar.image(image, caption='Image created by DALL·E 3')


def initialize_session():
    """
    Initializes or resets session variables.
    """
    if 'responses' not in st.session_state:
        st.session_state['responses'] = [
            {'text': 'How can I assist you?', 'image_url': None}]
    if 'requests' not in st.session_state:
        st.session_state['requests'] = []


def display_chat_history():
    """
    Displays the chat history.
    """
    for i, response in enumerate(st.session_state['responses']):
        with st.chat_message('assistant'):
            st.write(response['text'])
            if response['image_url']:
                st.image(response['image_url'], use_column_width=True)

        if i < len(st.session_state['requests']):
            with st.chat_message('user'):
                st.write(st.session_state['requests'][i])


def main():
    display_header_and_image()
    initialize_session()

    if 'agent' not in st.session_state:
        st.session_state.agent = create_agent()
    # container for chat history
    chat_container = st.container()

    # container for user's prompt
    prompt_container = st.container()

    with prompt_container:
        query = st.text_input(
            'Prompt: ', placeholder='Enter your prompt here..')
        if query:
            with st.spinner('Generating Response...'):
                result = st.session_state.agent({'input': query})
                st.session_state.requests.append(query)

                # Extract the URL from the result
                pattern = r'(.*:)\s*\[.*?\]\((.*?)\)'
                match = re.search(pattern, result['output'])
                if match:
                    text_before_link = match.group(1)
                    image_url = match.group(2)

                else:
                    text_before_link = result['output']
                    image_url = None

                # Store the response in the session state
                st.session_state.responses.append({
                    'text': text_before_link,
                    'image_url': image_url,
                })

    with chat_container:
        display_chat_history()


if __name__ == '__main__':
    main()

How to run Streamlit app

To run a Streamlit app, follow these steps:

Install Streamlit: Ensure you have Python installed, then install Streamlit using pip:

pip install streamlit

2. Run the Streamlit App: Navigate to the directory containing your script and run the following command in the terminal:

streamlit run app.py

3. Access the Streamlit App: Once the command is executed, Streamlit will start a local web server, and you can access the app in your web browser at http://localhost:8501.

By following these steps, you can easily run and develop a Streamlit app, creating interactive and visually appealing web applications for your projects.

Conclusion

Integrating agents within the LangChain framework significantly enhances the capabilities of Large Language Models (LLMs) by enabling them to access and utilize external data sources and tools, thereby overcoming the limitations of their training data. This multi-modal chatbot example, leveraging LangChain, ChatGPT, DALL·E 3, and Streamlit, demonstrates the effective use of REST Countries API, image generation, and Google search to provide a comprehensive and interactive user experience. By breaking down complex tasks into manageable steps and iteratively invoking the necessary tools, agents ensure accurate and enriched responses, ultimately offering a robust solution for detailed information retrieval and user interaction.

Google Search Tool: Useful for fetching information from the web