OpenAI Launches DALL-E 3 API and Text-to-Speech Models

OpenAI, the leading artificial intelligence research lab, has announced the launch of its latest offerings, the DALL-E 3 API and the Text-to-Speech API. These new tools were unveiled during OpenAI’s first-ever developer day, showcasing the company’s commitment to providing cutting-edge AI technologies to developers and users worldwide.

Key Takeaway

OpenAI has launched the DALL-E 3 API, expanding its text-to-image capabilities, and the Text-to-Speech API, providing developers with natural-sounding voices for various applications. While the DALL-E 3 API has some limitations compared to its previous version, the Text-to-Speech API offers improved user experience. Developers using OpenAI’s APIs must inform users that the generated content is AI-generated. OpenAI also introduced the latest version of Whisper, its open-source automatic speech recognition model, known for improved performance across languages.

Introducing the DALL-E 3 API

The DALL-E 3 API, an extension of OpenAI’s text-to-image model, is now available for developers to access. After its initial integration with ChatGPT and Bing Chat, DALL-E 3 can now be utilized as an independent API. OpenAI has also implemented a built-in moderation feature in the API to safeguard against misuse.

The DALL-E 3 API offers a range of format and quality options, including resolutions from 1024×1024 to 1792×1024. However, it currently has some limitations compared to its predecessor, DALL-E 2. For instance, users cannot create edited versions of images or generate variations of existing images with DALL-E 3. Additionally, OpenAI automatically rewrites generation requests for safety and detailed purposes, which may result in slightly less precise outcomes depending on the prompt.

Introducing the Text-to-Speech API

In addition to the DALL-E 3 API, OpenAI has launched the Text-to-Speech API, also known as the Audio API. This API provides developers with six preset voices to choose from, including Alloy, Echo, Fable, Onyx, Nova, and Shimer. OpenAI claims that the generated audio from this API sounds exceptionally natural, improving the user experience of various applications, such as language learning and voice assistance.

However, unlike some other speech synthesis platforms, OpenAI’s Text-to-Speech API does not offer control over the emotional affect of the generated audio. OpenAI acknowledges that certain factors, such as capitalization and grammar in the text being read aloud, may impact the voice’s sound. OpenAI’s internal tests in this regard have yielded mixed results.

Developer Requirements and Additional Announcements

Developers utilizing OpenAI’s APIs, including the DALL-E 3 API and the Text-to-Speech API, are required to inform users that the content is generated by AI.

In a related announcement, OpenAI has introduced the latest version of its open-source automatic speech recognition model, Whisper large-v3. The company claims that this new model offers improved performance across multiple languages. The Whisper large-v3 model is available on GitHub under a permissive license.