Hi there! In this post, we‘re going to explore the leading approaches to collecting speech data to train speech recognition systems in 2023. Having the right data collection strategy is crucial for building accurate models that can handle diverse real-world speech.
Let‘s overview the 5 key methods:
- Licensed Pre-packaged Datasets
- Public Open-Source Datasets
- Crowdsourced Custom Collections
- Customer Contributed Data
- In-house Data Collection
Now, let‘s dive into the details of each approach…
1. Licensed Pre-packaged Speech Datasets
Purchasing a pre-made speech dataset from a vendor is a simple way to get starter data for training speech recognition models. Major providers like Appen, Mozilla Common Voice, and VoxForge offer datasets with 100,000+ hours of speech audio in 100+ languages.
For example, the VoxForge corpus contains over 200,000 recordings annotated with text transcripts. According to its creators, it helps developers "bootstrap" their projects.
- Requires minimal effort – data is ready to download and use
- Covers common vocabulary and language variations
- Some datasets are neatly organized by demographic factors
- Lacks diversity needed for robust real-world applications
- Not customizable for unique industry/product vocabularies
- Quality and size varies significantly across datasets
For basic prototypes or non-commercial projects, pre-packaged speech data offers an easy way to get started. But real-world systems require more varied, higher quality data.
2. Public Open-Source Speech Datasets
Many research labs and tech giants like Google, Facebook, and IBM have released free public domain speech datasets to advance speech recognition capabilities.
These are collected through open contribution platforms. For example, Mozilla‘s Common Voice project has crowdsourced over 40,000 hours of voice data in 80 languages through its website and mobile app.
- Completely free to access and use
- Useful for pre-training models or proofs of concept
- Inconsistent recording quality
- Little control over speaker demographics
- Requires work to clean and prepare data before usage
While open datasets have fueled innovations in speech recognition, experts caution against relying solely on them to train production-ready systems.
3. Crowdsourced Custom Speech Data Collection
For more control over speech data variety and quality, many companies partner with crowdsourcing platforms that recruit and record contributors to create customized datasets.
Appen, for instance, has crowd-sourced speech data covering 400+ languages and dialects. Their contributors record audio on their own devices in environments selected to match the target use case.
- Build datasets aligned to required vocabulary, speaker demographics and languages
- Cost-effective way to gather large diverse corpuses
- Managed end-to-end by external company with quality review
- Less control over recording setups
- More effort to ensure data security and contributor privacy
- Need processes to verify data quality
Overall, crowdsourcing offers a scalable method to get high-quality speech data tailored to your application‘s needs.
4. Customer or Product Contributed Speech Data
Companies deploying speech recognition technologies can continuously collect user data as customers naturally interact with the product.
For example, Apple collects Siri voice recordings from opt-in users to train their models on linguistic patterns and accents specific to real-world usage scenarios.
- Provides real-world data from target users
- Captures current vocabulary and language trends
- Incoming new data to continuously enhance performance
- Must obtain clear user permissions
- Less control over recording environments
- Need large active user base for sufficient data
Customer data offers free high-value samples. But programs must respect privacy and provide value to users for sharing their personal data.
5. In-house Speech Data Collection
For niche use cases, companies may directly manage the speech data collection process in-house. This involves recruiting subjects, developing protocols, and recording audio internally.
Defense contractors often use this approach when developing classified speech recognition capabilities that require high levels of data privacy and control.
- Full control over speakers, scripts, equipment, and recording conditions
- Helps protect privacy for sensitive applications
- Expensive and time-intensive recruitment and recording effort
- Difficult to scale across demographics and languages
In-house collection is advisable when ultimate control over data is required, despite substantially higher costs.
Recommendations on Selecting the Optimal Approach
So which data collection method should you choose? Here are a few key considerations:
- Budget – Pre-made and crowdsourced data is lower cost than in-house efforts
- Languages – Crowdsourcing can cover 10x more languages than feasible in-house
- Use Cases – Tailor data to likely speakers, accents and vocabularies
- Data Privacy – More sensitive apps may require in-house data control
The best approach often combines methods. For instance, many companies start with existing datasets for initial model training, then augment with crowdsourced and customer data specific to their system requirements and use cases.
The Bottom Line
Effective speech data collection is critical yet challenging. By understanding the core methods available and key selection criteria, you can develop an optimal data strategy for your speech recognition application needs. With a data-driven plan in place, you‘ll set your project up for maximum accuracy and real-world performance.