-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Tags: feature-request, ux, integration
Quality Rating: ⭐ 7/10
Reporter: Yutong Zhan
Description
Currently, digital employees (agents) in Clawith primarily interact through text-based input and output. This feature request proposes adding multimodal support for both input and output, enabling richer and more natural interactions between users and digital employees.
Multimodal Input
- Images: Allow users to send images (screenshots, photos, diagrams) directly to digital employees for analysis, understanding, or processing.
- Voice/Audio: Support voice messages or audio file inputs that digital employees can transcribe and understand.
- Files/Documents: Enhanced support for various file formats as direct conversational input.
Multimodal Output
- Image Generation: Enable digital employees to generate and return images, charts, diagrams, and visualizations as part of their responses.
- Audio/Voice: Support text-to-speech or audio output for accessibility and convenience.
- Rich Media: Allow digital employees to compose and return rich content combining text, images, tables, and other media formats.
Use Cases
- A user sends a screenshot of an error to a digital employee for troubleshooting.
- A digital employee generates a chart or diagram to visualize data analysis results.
- Voice-based interaction for hands-free scenarios.
- A digital employee returns annotated images or design mockups as part of its workflow.
Expected Behavior
Digital employees should be able to receive, process, and generate content in multiple modalities (text, image, audio, video, files), providing a more versatile and human-like interaction experience.
Additional Context
This enhancement would significantly improve the usability and capability of Clawith digital employees, making them more competitive with modern AI assistant platforms that support multimodal interactions.