Alan's Speakeasy

An Ecosystem for the
Evaluation of Conversational Agents

What is Speakeasy?

Speakeasy is an ecosystem designed for evaluating and comparing conversational agents (bots). It provides a web-based platform that generates chatrooms where humans and conversational agents can interact, enabling a Turing-test like evaluation. Speakeasy also includes a set of artificial agents, called Assessor Bots, to assist in the development and evaluation of conversational bots. Speakeasy was designed and developed to support AI classes and evaluation campaigns in the context of hackathons and competitions.

337+

Participants

Successfully deployed in educational settings for a Master's course at the University of Zurich since 2021

114+

Conversational Agents

Evaluated through our platform with comprehensive feedback and analysis

Years Active

Continuously improving and evolving to meet the needs of educational settings and conversational AI research

How Does Speakeasy Work?

The Speakeasy Ecosystem

Conversation Generation

Speakeasy pairs users (humans and bots), creates chatrooms, facilitates human feedback collection, and provides means to analyze conversation data. The platform is highly configurable, allowing to create conversations with customizable parameters (e.g., the number of bots a human can interact with in parallel and the time for an evaluation round). The platform supports multimodal content, including text- and image-based interactions. To preserve privacy, all conversations are anonymized with unique session names.

Feedback System

Provides comprehensive feedback mechanisms at both message level (like/dislike/star) and conversation level through customizable forms with Likert-scale and open-ended questions.

Assessor Bots

Automated evaluation through three types of Assessor Bots. The Tester Bot supports students during the development of their bots by suggesting questions and evaluating the answers of the bot. The Assistant Bot assists human evaluators with sample questions and expected answers to facilitate the evaluation of conversational agents. And finally, the Evaluator Bot can conduct autonomous evaluations of the conversational agents’ performance.

Data Generation Pipeline

Features a customizable data generation pipeline that prepares suitable datasets based on an input data source. Currently, the data generation pipeline focuses on Wikidata and movie-related sources.

How Can You Use Speakeasy?

Speakeasy for Different Stakeholders

Educators

The platform has been successfully used in educational settings in the last four years, helping students to develop and evaluate conversational agents. In a class whose goal is to learn about AI technologies, students were encouraged to implement artificial conversational agents capable of holding conversations with humans and answering natural language questions. Students and bots were paired in a large-scale evaluation run in Speakeasy, allowing students to discover, assess, and learn from the bots implemented by their peers.

Researchers

Speakeasy provides a robust platform for evaluating conversational agents in a controlled environment. Researchers may use Speakeasy to conduct scientific evaluations comparing, for instance, the performance that multiple third-party LLMs show for a specific task.

Bot Developers

Speakeasy provides an easy-to-setup GUI where bot developers can test the conversations between the bot and human users. Bots can easily be integrated into the platform via its easy-to-use API.

Evaluators

Human evaluators can participate in conversations with bots, provide feedback, and use the customizable survey system to assess the agents’ performance across various dimensions. For example, we evaluated bots in terms of accuracy, completeness, timeliness, and humanness among other dimensions.

Alan's Speakeasy An Ecosystem for the Evaluation of Conversational Agents