Translator Device

Project Constraints

Primary limitations that will influence the project's design, scope, and execution.

Budget

Component selection (ESP32 model, microphone, display)
Overall quality and feature set
Prototyping costs and potential iterations

Parts & Components

Availability and shipping lead times
Ensure parts are high quality
Hardware compatibility and pin conflicts

Time

Software development and API integration
Learning curve for new technologies
Assembly, testing, and refinement cycles

Coding Ability

Handling real-time audio streams
Managing multiple cloud APIs and errors
Choice of IDE (Arduino vs. ESP-IDF)

Tool Access

Requires soldering iron, multimeter, etc.
3D printer needed for custom enclosure
Limits physical build quality and ease

Safety

LiPo battery charging and management
Proper wiring and short-circuit protection
Enclosure design for electronic safety

Investigation & Research

Understanding the current landscape of translation devices and related technologies is crucial for making informed design decisions. This research phase explores existing solutions, technical possibilities, and contextual factors that will shape our approach.

Existing Solutions & Case Studies

Select Earbuds

What it does: Real-time translation through earbuds connected to smartphone

Hardware: Bluetooth earbuds, smartphone dependency

Interface: Voice commands, touch controls

Pros: High accuracy, seamless integration

Cons: Requires smartphone, expensive

Pocketalk Translator

What it does: Dedicated handheld translation device with built-in SIM

Hardware: Custom device, cellular modem, touchscreen

Interface: Physical buttons, 2.4" display

Pros: Standalone, global connectivity

Cons: Subscription required, bulky

Travis Touch Go

What it does: Pocket translator with offline capabilities

Hardware: ARM processor, 3.5" touchscreen, dual mics

Interface: Touch interface, voice activation

Pros: Offline mode, multiple languages

Cons: Limited offline accuracy, expensive

M2 Language Translator Earbuds

What it does: Two-way translation earbuds for conversations

Hardware: Bluetooth 5.0, noise cancellation

Interface: App-based control, touch sensors

Pros: Discreet, conversation-focused

Cons: App dependency, battery life

DIY ESP32 Voice Projects

What it does: Community projects using ESP32 for voice processing

Hardware: ESP32, I2S microphones, MAX98357A amp

Interface: Physical buttons, LED indicators

Pros: Low cost, customisable, educational

Cons: Limited processing, complex setup

Inspiration & Technical Exploration

System Architecture Exploration

Based on the research devices, several architectural approaches can be considered. The following are the most promising pathways.

Hardware Architecture Concepts

Concept 1: Minimal ESP32 Design

Pros: Low cost (~$30), simple assembly
Cons: No visual feedback, limited user interface
Use case: Proof of concept, basic functionality testing

Concept 2: Enhanced Display Design

Pros: Full UI, language selection, status display
Cons: Higher cost (~$60), more complex assembly
Use case: Production-ready device with full features

Software Architecture Flow

Translation Process Flow

Key Technical Decisions:

• Use FreeRTOS tasks for parallel processing
• Implement circular buffer for audio streaming
• Add retry logic with exponential backoff
• Cache common phrases to reduce API calls

Visual Design Inspiration & Component Photos

ESP32-S3 Development Board

INMP441 I2S Microphone

MAX98357A I2S Amplifier

2.4" ILI9341 TFT Display

Touch screen

Rectangular, 120x60x15mm, touchscreen interface

Compact Pod

Compact, 80x80x25mm, button-based interface; Smaller screen

Pendant Style

Wearable, 50x30x20mm, Single Button language switch

The devices shown are third party products.

Development Platform Options

Arduino IDE

Pros: Simple, familiar, good libraries

Cons: Limited debugging, basic audio support

ESP-IDF

Pros: Full control, advanced audio, debugging

Cons: Steeper learning curve, more complex

PlatformIO

Pros: Best of both, VS Code integration

Cons: Additional setup complexity

Visual Studio Code

Pros: Many extensions to assist with coding

Cons: Non-native platform

API Service Comparison

Google Cloud APIs

Accuracy: 95%+ speech, 90%+ translation

Cost: $1.44/hour speech recognition

Azure Cognitive

Accuracy: 93%+ speech, 88%+ translation

Cost: $1.00/hour speech recognition

Amazon Polly/Transcribe

Accuracy: 92%+ speech, 85%+ translation

Cost: $0.96/hour speech recognition

Multiple Providers

Accuracy: Varies

Cost: Typically most cost effective

Assembly Process

1 Breadboard prototype testing

2 PCB design and ordering

3 Component soldering

4 Enclosure 3D printing

5 Final assembly & testing

Power Consumption Analysis

Whilst these numbers are given from the datasheet, they may not reflect all real world use cases.

Component Power Draw

ESP32-S3 (Active): 240mA @ 3.3V

ESP32-S3 (Sleep): 10μA @ 3.3V

TFT Display (On): 50mA @ 3.3V

Audio Components: 80mA @ 3.3V

Total (Active): 370mA @ 3.3V

Battery Life Estimation

2000mAh LiPo (Active): ~5.4 hours

With 50% sleep time: ~8.5 hours

Standby (sleep mode): >1 month

Strategy: Have options for sleep modes between translations, display auto-off after 30s, Wi-Fi power saving to maximise power saving when desired by the end user.

Analysis of Findings

Key Design Influences from Research

Hardware Design Decisions

• I2S Audio Pipeline: INMP441 → ESP32-S3 → MAX98357A will provide professional-grade audio quality similar to high-end commercial devices available on the market.
• Display Integration: 2.4" TFT allows text display of translations, addressing translation confidence. (showing transcriptions)
• Standalone Design: No smartphone dependency eliminates connectivity issues, privacy concerns and ease of use
• Physical Controls: Dedicated push-to-talk button ensures reliable operation under stress
• Compact Form Factor: Pocket-sized design inspired by successful devices like Pocketalk

Software Architecture Decisions

• Google Cloud APIs: Primary choice for highest accuracy rates (95%+ speech recognition)
• Azure Fallback: Redundancy system prevents single-point-of-failure
• ESP-IDF Platform: Advanced audio handling capabilities justify learning curve
• FreeRTOS Tasks: Parallel processing prevents UI blocking during API calls
• Progressive UI: Visual status indicators address user uncertainty during processing

Critical Technical Challenges

Real-time Audio Processing

Challenge: ESP32 has limited RAM (512KB) for audio buffering

Solution: Implement circular buffer with 4KB chunks, stream directly to API without full file storage

Implementation: Use I2S DMA for hardware-level audio capture, FreeRTOS queue for buffer management

Network Reliability

Challenge: Wi-Fi dropouts during translation cause user frustration

Solution: Exponential backoff retry (1s, 2s, 4s), connection health monitoring, user feedback

Implementation: Background task monitors RSSI, preemptive reconnection, cached translation pairs

Power Management

Challenge: Constant Wi-Fi + display + audio processing drains battery quickly

Solution: Aggressive sleep states, display dimming, Wi-Fi power save mode

Implementation: Modem sleep between translations, 5-second display timeout, 240MHz → 80MHz scaling

Audio Quality Control

Challenge: Background noise and varying input levels affect recognition accuracy

Solution: Software AGC (Automatic Gain Control), noise gate, pre-processing filters

Implementation: Digital filters in ESP32, volume normalisation, silence detection thresholds

Design Pitfalls to Avoid

Complex Menu Systems

Problem: Multi-level menus slow down urgent translation needs

Avoidance: Maximum 2-level menu depth, easy to navigate, clear visual hierarchy

Reference: Travis Touch Go suffers from deep menu navigation

Subscription Dependencies

Problem: Monthly fees create barriers to usage and ownership

Avoidance: Pay-per-use API model, user controls their own API keys, offline fallback

Reference: Pocketalk's $50/year subscription reduces adoption

Fragile Construction

Problem: Travel devices need to survive drops and environmental stress

Avoidance: Reinforced corners, flexible materials, internal shock mounting for electronics

Reference: Many DIY projects fail due to poor mechanical design and circuitry

Unclear Status Indicators

Problem: Users lose confidence when they can't see device state

Avoidance: Always-visible status, progress bars, clear error messages

Reference: Wireless Earbuds (e.g. Samsung Buds 4) provide minimal feedback, causing user confusion

Poor Audio Quality

Problem: Low-quality microphones cause speech recognition failures

Avoidance: Professional-grade I2S microphone, proper acoustic design, noise cancellation

Reference: Arduino projects often use poor analogue microphones

Implementation Strategy Based on Findings

Phase 1: Core Functionality

• Build minimal hardware (ESP32-S3 + INMP441 + MAX98357A)
• Implement basic audio capture and playback
• Test Eleven Labs and LLMs
• Validate translation pipeline end-to-end
• Measure latency and audio quality

Phase 2: User Interface

• Add 2.4" TFT display for visual feedback
• Implement language selection interface
• Add status indicators and progress bars
• Create error handling and retry logic
• Test usability with target users

Phase 3: Production Ready

• Design and 3D print professional enclosure
• Implement power management and battery charging
• Add fallback API services for reliability
• Optimise for 3+ second translation targets
• Conduct durability and field testing

Success Criteria

Success for this project is not just creating a working device, but one that meets specific, measurable targets. This section defines the benchmarks for functionality, performance, usability, and staying within the project constraints. Ensuring this criteria is met will ensure that the device was successful.

1. Functionality

Does the device perform its core tasks correctly?

Core Feature Checklist

✓Audio capture starts/stops on command.
✓Device connects to Wi-Fi.
✓Audio sent to speech-to-text service.
✓Text sent to translation service.
✓Translated text sent to text-to-speech.
✓Translated audio is played.

Accuracy Targets

Translation: >90%

Speech Recognition: >95%

2. Performance & Reliability

How well and how consistently does the device work?

Translation Speed

Target: < 3.0s

Wi-Fi Reconnect

< 15s

Operational Stability

1 Hr

minimum continuous use

3. Usability

Is the device practical and easy for someone to use?

Device Setup

< 60s

to success

Portability

✓

Pocket friendly

Battery Life

1+ Hr

active use

4. Adherence to Constraints

Did the project meet its initial goals and limits?

Final Budget

On Target

$100 maximum

Timeline

On Target

Project started