WEB_Tech blog graphics_Build Advanced Customer Support LLM Multi-Agent Workflow_banner_2025-08
August 06, 2025

Build Advanced Customer Support LLM Multi-Agent Workflow

Background: The Scale and Stakes of Customer Support at Socure

At Socure, we verify the identity of millions of individuals every day, providing real-time identity verification and fraud detection for over 3,000 enterprises. Our platform processes enormous volumes of sensitive Personally Identifiable Information (PII) and transaction data. For many of our customers, Socure is a critical part of their business infrastructure. A single disruption can delay customer onboarding, affect risk analysis, or even halt essential services.

As the adoption of our platform grows, so does the demand on our small but mighty customer support team. Each business day, dozens to hundreds of support cases are opened. These range from technical troubleshooting and API errors to questions about product features, integration, configuration, or compliance.

Why GenAI? Why Now?

To address this growing demand, we turned to generative AI and large language models (LLMs). Our goal was not to replace our support team but to empower them. We wanted to:

  • Automate routine or repetitive support tasks
  • Accelerate research and drafting of responses
  • Provide 24/7 support augmentation for quick turnaround
  • Scale support without proportionally growing the team

After evaluating different frameworks, we chose to build our solution using LangGraph, a graph-based multi-agent orchestration library built on LangChain, and host our LLMs on AWS Bedrock, which provides scalable, secure, and fully managed access to top-tier foundation models.

Solution Overview: A Multi-Agent LLM Support Framework

Our architecture is built on the principle of task specialization: different agents are responsible for answering different types of support questions. This allows us to leverage smaller, focused retrieval-augmented generation (RAG) contexts, and ensures responses are accurate and domain-specific.

The Agents

We built six domain-specialized agents:

  1. DevHub Agent – Focuses on public API documentation, SDK integration issues, admin dashboard configuration, and customer-facing product behavior.
  2. Glean Agent – Indexes our internal knowledge base—covering product strategy, SDLC, testing practices, SRE runbooks, internal communications, HR policies, and more.
  3. BI Agent – Handles analytics, customer insights, product usage trends, and sales performance dashboards.
  4. Salesforce Agent – Connects to our CRM to answer questions about specific customer accounts, leads, sales pipelines, and opportunities.
  5. Troubleshooting Agent – Specializes in ID+ transaction errors, timeouts, rate-limiting, and diagnostics.
  6. Legal Agent – Responds to questions related to compliance, data residency, privacy programs, contractual terms, and regulated markets.

Each agent uses a combination of RAG with vector search, metadata filtering, and custom tools to reason about its domain.

LangGraph-Based Architecture

The following diagram illustrates the core flow of our LangGraph-based system:

Screenshot 2025-06-25 at 12.09.21 PM

First, we define the state of the LangGraph as below:

from typing import Annotated, List, Sequence, Union
from typing_extensions import TypedDict
import operator
from langchain.schema import BaseMessage
from langgraph.prebuilt.tool_executor import AgentAction, AgentFinish

class AgentState(TypedDict):
"""
Represents the shared state passed between nodes in the LLM agent workflow graph.
"""

# Unique identifier for the current session
session_id: str
# User input string
input: str
# Rephrased, standalone version of the input
rephrased_input: str
# The ID of the message associated with this agent run
message_id: str
# Identifier for the event type (e.g., question, feedback)
event_type: str
# User metadata
user_id: str
user_name: str
# The designated task explicitly selected or inferred
designated_task: str
# Up to 3 preferred task names inferred from classification
preferred_tasks: List[str]
# The last task that produced an answer
previous_task: str
# Full chat history up to this point
chat_history: List[BaseMessage]
# Most recent LLM agent output (can be None at initial state)
agent_outcome: Union[AgentAction, AgentFinish, None]
# Steps taken by the agent: (action, observation) pairs
intermediate_steps: Annotated[List[tuple[AgentAction, str]], operator.add]
# Additional streamed/generated messages to be appended
messages: Annotated[Sequence[BaseMessage], operator.add]

# Route decision: which node to invoke next
next: str

# Final response data
final_answer: str
final_citations: List[str]
final_confidence: str
final_latency: float
final_agent: str

In addition, we designed a LangGraph workflow that behaves like a task router and judge:

  1. Query Ingestion: A customer query comes in through Slack or the support portal. It is parsed, normalized, and sent to the LangGraph workflow.
  2. Receptionist Node: This node rephrase the user’s questions based on the chat history and user’s context.
  3. Supervisor Node (Intent Classifier): This node infers the customer’s intent using a fine-tuned LLM. Based on the topic (e.g., “Why am I getting HTTP 429 errors?”), it selects up to three most relevant agents.
  4. Parallel Agent Execution: The selected agents execute in parallel. Each agent retrieves relevant context (docs, dashboards, tickets) and generates a proposed answer.
  5. Answer Evaluation: An evaluation node compares the outputs and selects the most complete and correct answer. If no answer meets a confidence threshold, the case is escalated to a human agent.
  6. Answer Delivery + Feedback: The selected response is sent back to the support interface, with citations and an inline feedback module (👍 / 👎 / comment). This feedback is stored for future evaluation and fine-tuning.
  7. Chat History + Contextual Memory: We enrich queries with historical context (previous chats, support tickets, account config, usage patterns) to help agents answer better.

Technical Challenges and How We Solved Them

1. Build a chatbot with contextual enrichment – our context builder composes payloads from chat history, account metadata, and prior tickets — and use that context to rephrase the user’s question (see sample code below)

from loguru import logger
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.schema.output_parser import StrOutputParser

def receptionist_node(state: dict) -> dict:
"""
Rephrases the follow-up question into a standalone question using context from chat history.

Args:
state (dict): A dictionary containing:
- 'user_name': str
- 'chat_history': List of previous messages
- 'input': str (follow-up question)

Returns:
dict: {
'designated_task': str,
'input': str (original input),
'rephrased_input': str (standalone version)
}
"""
logger.info("> receptionist_node")
logger.debug(f"state = {state}")

# Analyze the user input to extract task and input question
designated_task, designated_input = analyze_input(state["input"])

# Define the system prompt template for rephrasing
system_prompt = """

You are SocureBuddy, an expert assistant.

Given the user name, conversation history, and a follow-up question, your task is to rephrase the follow-up question so that it can stand alone without relying on the previous conversation for context.

Your rephrased output MUST:
1. Preserve all characters, punctuation, and formatting in the original follow-up question—including angle brackets, markdown, and URLs.
2. Not paraphrase or modify quoted content, URLs, or identifiers.
3. Follow this format exactly (including punctuation and spacing):
"My name is {user_name}, and my question is as below: {{standalone question}}"

User Name:
{user_name}

Conversation History:
{chat_history}

Follow-Up Question:
{input}

Standalone Question:
"""

# Build prompt and rephrasing chain
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
])

rephrase_chain = prompt | llm | StrOutputParser()

# Invoke the chain
rephrased_input = rephrase_chain.invoke({
"user_name": state["user_name"],
"chat_history": state["chat_history"],
"input": designated_input,
})

return {
"designated_task": designated_task,
"input": designated_input,
"rephrased_input": rephrased_input,
}

2. Inferring Intent to Select Relevant Agents – We trained a classifier on past support cases to map questions to categories — and we form a task force (i.e. a selected tools) to run against the question in parallel (see sample code below)

from loguru import logger
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import JsonOutputParser# LLM-based selection of top 3 tasks
tasks = [

"TroubleshootingTask",
"DevhubTask",
"RiskosTask"
"EnterpriseTask",
"LegalTask",
"BusinessIntelligenceTask",
"SalesforceTask",
"GenericTask"
]

def supervisor_node(state: dict) -> dict:
"""
Determines which task(s) are most relevant to the user's question.

Args:
state (dict): The LangGraph state dictionary. Must contain:
- 'rephrased_input': the clarified user question.
- 'previous_task' (optional): a previously chosen task.

Returns:
dict: A dictionary with keys:
- 'agent_outcome': type of the resulting agent action (placeholder).
- 'preferred_tasks': List of selected task names.
"""
logger.info("> supervisor_node")
logger.debug(f"State: {state}")

# Prompt template for task selection
task_selection_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert assistant that selects which tasks are relevant to a user's question. You must choose from the following list of tasks:

{tasks}

## Instructions:
- Select the most relevant **1 to 3** tasks.
- If only 1 task is clearly relevant, return just that 1.
- Return 3 only if all are strongly relevant.
- If you are unsure which task applies, **default to ["EnterpriseTask"]**.
- Do NOT choose TroubleshootingTask unless the question includes a **transaction ID**.
- A transaction ID is a UUID of the sample forms below:

- `39e51429-5b62-41c4-8357-25616b8fa704`

- `d02bc876-7183-4435-8362-138af9a31483`

- Questions about **reason codes (e.g. R909, R907)** without a transaction ID should be routed to `DevhubTask` or `EnterpriseTask`.
- Questions about **PhoneRisk, EmailRisk, or AddressRisk** should be routed to **["DevhubTask", "EnterpriseTask"]**.
- Unless the question specifically mentions **"riskos"**, in which case it should be routed to **["RiskosTask"]**.
- Never return more than 3 tasks.
- Do NOT include unrelated tasks just to fill space.
- Respond with a valid JSON list and no other text.

## Output Format:
Respond ONLY with a valid JSON list like:
["DevhubTask"]
["MarketingTask", "LegalTask"]
["EnterpriseTask", "SalesforceTask", "BusinessIntelligenceTask"]

Do NOT include any explanation, formatting, or prefixes.
"""),
# ✳️ Few-shot example 1
("human", "What is reason code I998?"),
("assistant", '["DevhubTask", "EnterpriseTask"]'),

# ✳️ Few-shot example 2
("human", "How many employees do we have?"),
("assistant", '["EnterpriseTask"]'),

# ✳️ Few-shot example 3
# ...
])

# Create the chain with partial task list and JSON output parser
chain = (
task_selection_prompt.partial(tasks=str(tasks)) # Assumes `tasks` is defined in global context
| llm
| JsonOutputParser()
)

# Run inference
question = state["rephrased_input"]
previous_task = state.get("previous_task", "")
preferred_tasks = chain.invoke({"question": question})

# Ensure continuity with previous task (if it was valid)
if previous_task and previous_task not in preferred_tasks:
preferred_tasks.append(previous_task)

logger.info(f"[Supervisor] Preferred tasks: {preferred_tasks}")

return {

"agent_outcome": AgentAction, # Replace with actual outcome value or logic
"preferred_tasks": preferred_tasks

}

3. Selecting the Best Answer – We apply evaluation heuristics and plan to fine-tune reward models for preference learning.

def compare_results_with_llm(question: str, results: Dict[str, Dict[str, str]]) -> str:
"""
Compares answers from Devhub and Enterprise agents and selects the best one
based on accuracy, relevance, and clarity using an LLM chain.

:param question: The original question to provide context for the LLM.
:param results: Dictionary containing answers and citations from Riskos, Devhub and Enterprise agents.
:return: The best agent ('Riskos' or 'DevHub' or 'Enterprise') based on the comparison.
"""

agent_blocks = "\n\n".join([
f"- {agent} Agent Answer: {data['answer']}\n"
f"- {agent} Agent Confidence: {data['confidence']}"
for agent, data in results.items()
])

# Construct system prompt with question and answers from both agents
system_prompt = """
You are an advanced AI tasked with evaluating and comparing answers from multiple AI agents to select the best response.
Your goal is to analyze the given question and the answers based on accuracy, relevance, clarity, and confidence scores, then decide which answer is superior.

Question: {question}

{agent_blocks}

# Criteria for Evaluation
- **Accuracy**: Assess the correctness of the information presented in each answer.
- **Relevance**: Evaluate how well each answer addresses the question asked.
- **Clarity**: Consider how clearly and fluently each answer is expressed.
- **Confidence Score**: How confident is the AI agent in its answer?

# Decision Rule
- Exclude any answers that are empty.
- Prioritize the answer with the highest confidence score if it exceeds others by a significant margin (e.g., at least 0.2 higher).
- Favor answers that demonstrate greater certainty or precision by **quantitative figures**, **specific examples**, or **detailed evidence**.
- Prefer detailed and in-depth responses over vague or general ones.
- Select the most accurate, relevant, and clear answer when one stands out based on these criteria.
- In cases where no answer is clearly superior, choose the longest response.

# Output Format
Respond ONLY with the agent name (e.g., "devhub", "marketing", "glean", "generic", "legal", "troubleshooting", "ironclad").

**Do not include any additional text or explanation in your response.**
"""

# Add a user message to start the conversation per Bedrock requirements
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "Which agent's answer is best?"),
])

llm_chain = prompt | llm | StrOutputParser()

# Prepare inputs dynamically and invoke the LLM to evaluate and return the best agent as a string
# ✅ Pass prompt variables explicitly here
evaluation_result = llm_chain.invoke({
"question": question,
"agent_blocks": agent_blocks,
})

return evaluation_result

Evaluation, Oversight, and Feedback

A human-in-the-loop workflow is essential in regulated and high-stakes domains like identity verification. We designed our system with three layers of safety:

  1. Confidence Thresholds: Only high-confidence responses are sent directly to users. Others are routed to support reps for review.
  2. Evaluation Service: We log each agent’s answers, reasoning steps, and feedback. This allows our QA team to retrain models and refine agent scopes.
  3. Thumbs-Up / Down and Commentary Feedback: Users can give real-time feedback per response. This is logged and tied to conversation history.

Here is a sample UI for us to show the answer along with the citations, confidence score, and feedback options.

Screenshot 2025-06-25 at 12.36.53 PM

Results So Far

Since launching this framework in beta for our internal support team, we’ve seen:

  • 50%+ automation of routine support tickets
  • 70% faster response times for high-volume FAQs
  • Improved customer satisfaction scores (via post-case surveys)
  • Increased support team capacity without headcount growth

Our team now focuses on strategic and edge cases instead of repetitive tasks.

What’s Next: From Static Agents to Reasoning Workflows

We’re integrating advanced reasoning models:

ReWOO: Plan First, Then Execute

Some questions can’t be answered in a single step (e.g., “Why did this customer’s transaction fail in the last 24 hours?”). We’re exploring ReWOO—a reasoning model that first plans a multi-step workflow and then executes each step with specialized agents. Each agent contributes a sub-answer, which is later summarized.

ReAct: Think Step-by-Step

We also plan to implement ReAct, a framework where the LLM reasons and acts iteratively. This allows it to call tools, perform lookups, reason about outcomes, and repeat until the final solution is reached.

Final Thoughts

At Socure, we believe that customer support is a key product surface. With the power of LangGraph, AWS Bedrock, and domain-specific LLM agents, we’re transforming our support model from reactive to proactive, from manual to intelligent.

This approach enables us to scale customer service without compromising quality, even as our platform—and our customer base—continue to grow. We’re just getting started.

Tao Tao
Posted by
Tao Tao
Share:
Tao Tao

Tao Tao

Tao Tao is a Principal Engineer at Socure and a former Googler and IBM researcher with over 15 years of experience in advanced software research and development. He has driven key initiatives in Generative AI, security engineering, and data privacy. Tao holds over 12 U.S. patents and has authored more than 20 research and technical papers.