Speakeasy is an ecosystem designed for evaluating and comparing conversational agents (bots). It provides a web-based platform that generates chatrooms where humans and conversational agents can interact, enabling a Turing-test like evaluation. Speakeasy also includes a set of artificial agents, called Assessor Bots, to assist in the development and evaluation of conversational bots. Speakeasy was designed and developed to support AI classes and evaluation campaigns in the context of hackathons and competitions.
Successfully deployed in educational settings for a Master's course at the University of Zurich since 2021
Evaluated through our platform with comprehensive feedback and analysis
Continuously improving and evolving to meet the needs of educational settings and conversational AI research
Speakeasy pairs users (humans and bots), creates chatrooms, facilitates human feedback collection, and provides means to analyze conversation data. The platform is highly configurable, allowing to create conversations with customizable parameters (e.g., the number of bots a human can interact with in parallel and the time for an evaluation round). The platform supports multimodal content, including text- and image-based interactions. To preserve privacy, all conversations are anonymized with unique session names.
Provides comprehensive feedback mechanisms at both message level (like/dislike/star) and conversation level through customizable forms with Likert-scale and open-ended questions.
Automated evaluation through three types of Assessor Bots. The Tester Bot supports students during the development of their bots by suggesting questions and evaluating the answers of the bot. The Assistant Bot assists human evaluators with sample questions and expected answers to facilitate the evaluation of conversational agents. And finally, the Evaluator Bot can conduct autonomous evaluations of the conversational agents’ performance.
Features a customizable data generation pipeline that prepares suitable datasets based on an input data source. Currently, the data generation pipeline focuses on Wikidata and movie-related sources.
The platform has been successfully used in educational settings in the last four years, helping students to develop and evaluate conversational agents. In a class whose goal is to learn about AI technologies, students were encouraged to implement artificial conversational agents capable of holding conversations with humans and answering natural language questions. Students and bots were paired in a large-scale evaluation run in Speakeasy, allowing students to discover, assess, and learn from the bots implemented by their peers.
Speakeasy provides a robust platform for evaluating conversational agents in a controlled environment. Researchers may use Speakeasy to conduct scientific evaluations comparing, for instance, the performance that multiple third-party LLMs show for a specific task.
Speakeasy provides an easy-to-setup GUI where bot developers can test the conversations between the bot and human users. Bots can easily be integrated into the platform via its easy-to-use API.
Human evaluators can participate in conversations with bots, provide feedback, and use the customizable survey system to assess the agents’ performance across various dimensions. For example, we evaluated bots in terms of accuracy, completeness, timeliness, and humanness among other dimensions.