Init commit

2025-10-23 21:54:13 -07:00 · 2025-10-23 21:54:13 -07:00 · 85691f13dc
commit 85691f13dc
36 changed files with 2655 additions and 0 deletions
--- a/dailyrutines.dubbed.es.aligned.mp4
+++ b/dailyrutines.dubbed.es.aligned.mp4
--- a/dailyrutines.dubbed.es.mixed.mp4
+++ b/dailyrutines.dubbed.es.mixed.mp4
--- a/dailyrutines.dubbed.es.mixed.subs.mp4
+++ b/dailyrutines.dubbed.es.mixed.subs.mp4
--- a/dailyrutines.dubbed.es.mp4
+++ b/dailyrutines.dubbed.es.mp4
--- a/dailyrutines.dubbed.es.subs.mp4
+++ b/dailyrutines.dubbed.es.subs.mp4
--- a/dailyrutines.dubbed.gemini.mp4
+++ b/dailyrutines.dubbed.gemini.mp4
--- a/dailyrutines.dubbed.mp4
+++ b/dailyrutines.dubbed.mp4
--- a/dailyrutines.mp4
+++ b/dailyrutines.mp4
--- a/dailyrutines.replaced_audio.mp4
+++ b/dailyrutines.replaced_audio.mp4
--- a/dailyrutines.replaced_audio.subs.mp4
+++ b/dailyrutines.replaced_audio.subs.mp4
--- a/prompt_init.md
+++ b/prompt_init.md
@ -0,0 +1,112 @@
+You can run OpenAI's **Whisper** model for audio-to-text transcription on a **CPU** using **PyTorch**, typically by either using the original `openai-whisper` library or the Hugging Face `transformers` implementation.
+
+### Using the `openai-whisper` library
+
+1.  **Installation:** Ensure you have Python, PyTorch (CPU version), and **FFmpeg** installed.
+
+    ```bash
+    # Install the Whisper package
+    pip install -U openai-whisper
+    # On Linux, install FFmpeg (example for Debian/Ubuntu)
+    sudo apt update && sudo apt install ffmpeg
+    ```
+
+2.  **Specify CPU in Python:** In your Python script, explicitly load the model and move it to the CPU device. You can also pass the `device='cpu'` argument directly to `whisper.load_model()`.
+
+    ```python
+    import whisper
+
+    # Load the model and specify 'cpu' as the device
+    model = whisper.load_model("base", device='cpu') 
+
+    # Or, if loading and then moving:
+    # model = whisper.load_model("base").to("cpu") 
+
+    # Transcribe the audio file
+    result = model.transcribe("path/to/your/audio.mp3", fp16=False) # fp16=False is recommended for CPU
+
+    print(result["text"])
+    ```
+
+    *Note: Using a smaller model like `"tiny"` or `"base"` will be significantly faster on a CPU.*
+
+-----
+
+### Using the Hugging Face `transformers` library
+
+The Hugging Face `transformers` library also provides a way to run Whisper and often includes optimizations:
+
+1.  **Installation:** Install the necessary libraries, ensuring you have the CPU-only version of PyTorch if you don't have a GPU.
+
+    ```bash
+    pip install transformers datasets accelerate torch
+    ```
+
+2.  **Setup and Pipeline:** Use the PyTorch `AutoModelForSpeechSeq2Seq`, `AutoProcessor`, and `pipeline`, explicitly setting the device to `"cpu"`:
+
+    ```python
+    import torch
+    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+
+    # Set device to CPU
+    device = "cpu"
+    torch_dtype = torch.float32 # Use float32 on CPU for standard performance
+
+    # Choose a model size
+    model_id = "openai/whisper-base" # Example model
+
+    # Load model and processor
+    model = AutoModelForSpeechSeq2Seq.from_pretrained(
+        model_id, 
+        torch_dtype=torch_dtype, 
+        low_cpu_mem_usage=True, 
+        use_safetensors=True
+    ).to(device)
+
+    processor = AutoProcessor.from_pretrained(model_id)
+
+    # Create the ASR pipeline
+    pipe = pipeline(
+        "automatic-speech-recognition",
+        model=model,
+        tokenizer=processor.tokenizer,
+        feature_extractor=processor.feature_extractor,
+        torch_dtype=torch_dtype,
+        device=device,
+    )
+
+    # Transcribe
+    result = pipe("path/to/your/audio.mp3")
+    print(result["text"])
+    ```
+
+-----
+
+### Optimization: `faster-whisper`
+
+For much better performance on a CPU (up to 4 times faster), consider using the **`faster-whisper`** library, which uses the CTranslate2 inference engine:
+
+1.  **Installation:**
+
+    ```bash
+    pip install faster-whisper
+    ```
+
+2.  **Usage:**
+
+    ```python
+    from faster_whisper import WhisperModel
+
+    model_size = "base" # Choose a model size
+
+    # Run on CPU with INT8 precision for speed
+    model = WhisperModel(model_size, device="cpu", compute_type="int8") 
+
+    segments, info = model.transcribe("path/to/your/audio.mp3", beam_size=5)
+
+    for segment in segments:
+        print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+    ```
+
+The [Whisper: Install Guide](https://www.youtube.com/watch?v=XX-ET_-onYU) video walks through the initial installation steps for Whisper AI, which is a prerequisite for running it with PyTorch on any device.
+http://googleusercontent.com/youtube_content/0
--- a/whisper_project/README.md
+++ b/whisper_project/README.md
@ -0,0 +1,131 @@
+# Proyecto de ejemplo: Transcriptor Whisper (CPU)
+
+Un pequeño proyecto que demuestra cómo transcribir archivos de audio en CPU usando tres enfoques:
+
+- `openai-whisper` (paquete original)
+# Proyecto de ejemplo: Transcriptor y doblador (Whisper + Coqui TTS)
+
+Este repo contiene utilidades para transcribir audio/video en CPU usando
+distintos backends de Whisper y para generar doblaje por segmentos usando
+Coqui TTS. Está pensado como una base reproducible y ligera para pruebas en CPU.
+
+Contenido principal
+- `transcribe.py` — CLI principal: transcripción (openai-whisper /
+	transformers / faster-whisper), generación de SRT (incluyendo fallback),
+	síntesis TTS por segmento y pipeline de doblaje (replace / mix).
+- `process_video.py` — pipeline alto nivel (extraer audio, transcribir,
+	traducir/opcional, generar SRT y quemar subtítulos en video).
+
+Requisitos del sistema
+- `ffmpeg` (disponible en PATH)
+- Para Coqui TTS se recomienda usar Miniforge/Conda con Python 3.11 en CPU.
+
+Instalación rápida
+
+1) Entorno ligero (solo transcripción con `faster-whisper` y dependencias mínimas):
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+2) Entorno para Coqui TTS (recomendado si vas a sintetizar/doblar):
+
+```bash
+# Instalar Miniforge/Miniconda si no lo tienes, luego:
+conda create -n tts_env python=3.11 -y
+conda activate tts_env
+# PyTorch CPU + TTS
+python -m pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio
+python -m pip install TTS faster-whisper
+```
+
+Nota: en la sesión de ejemplo se creó `tts_env` y se instaló `TTS` y `faster-whisper`.
+
+Uso — ejemplos prácticos
+
+1) Transcribir un audio y generar SRT (faster-whisper, modelo `base`):
+
+```bash
+# desde el entorno donde tengas faster-whisper disponible
+python whisper_project/transcribe.py \
+	--file whisper_project/dailyrutines.audio.wav \
+	--backend faster-whisper --model base --srt
+```
+
+2) Generar SRT de fallback (sin texto) — divide en segmentos uniformes:
+
+```bash
+python whisper_project/transcribe.py -f path/to/audio.wav -b transformers --srt --srt-fallback
+```
+
+3) Transcripción por segmentos (extrae piezas y transcribe cada una):
+
+```bash
+python whisper_project/transcribe.py -f path/to/audio.wav --segment-transcribe --srt --srt-segment-seconds 8
+```
+
+4) Doblaje por segmentos (replace = reemplaza voz original):
+
+```bash
+# usando el entorno tts_env donde instalaste Coqui TTS
+conda activate tts_env
+python whisper_project/transcribe.py \
+	--file whisper_project/dailyrutines.audio.wav \
+	--segment-transcribe --srt --srt-file whisper_project/dailyrutines.kokoro.dub.srt \
+	--srt-segment-seconds 6 \
+	--tts-model tts_models/en/ljspeech/tacotron2-DDC \
+	--tts-model-repo tts_models/en/ljspeech/tacotron2-DDC \
+	--dub --dub-mode replace --dub-out whisper_project/dailyrutines.kokoro.dub.wav
+```
+
+5) Doblaje por segmentos (mix = mezcla TTS con original):
+
+```bash
+python whisper_project/transcribe.py \
+	--file whisper_project/dailyrutines.audio.wav \
+	--segment-transcribe --dub --dub-mode mix --dub-mix-level 0.7 \
+	--tts-model tts_models/en/ljspeech/tacotron2-DDC --dub-out out_mix.wav
+```
+
+Remuxar audio doblado en el MP4 y quemar subtítulos
+
+1) Reemplazar la pista de audio en el MP4 por la pista doblada (sin recomprimir video):
+
+```bash
+ffmpeg -y -i dailyrutines.mp4 -i whisper_project/dailyrutines.kokoro.dub.wav -c:v copy -map 0:v:0 -map 1:a:0 -shortest dailyrutines.kokoro.dub.mp4
+```
+
+2) Quemar subtítulos (hardcode) en el video (requiere re-encode del video):
+
+```bash
+ffmpeg -y -i dailyrutines.mp4 -vf "subtitles=whisper_project/dailyrutines.kokoro.dub.srt:force_style='FontName=Arial,FontSize=24'" -c:a copy dailyrutines.kokoro.subs.mp4
+```
+
+Notas sobre modelos Hugging Face y tokens
+- Si el repo del modelo TTS o del modelo de ASR es privado necesitarás
+	exportar `HUGGINGFACE_HUB_TOKEN` en el entorno antes de ejecutar el script
+	para que `huggingface_hub.snapshot_download` pueda acceder. Ejemplo:
+
+```bash
+export HUGGINGFACE_HUB_TOKEN="hf_xxx..."
+```
+
+Rendimiento y recomendaciones
+- En CPU usa modelos pequeños (`tiny`, `base`) para tiempos aceptables.
+- `faster-whisper` con `compute_type=int8` reduce memoria y acelera en CPU.
+- Para producción con GPU, instala las ruedas de PyTorch/GPU apropiadas
+	y activa `gpu=True` en las llamadas a TTS y whisper si tu hardware lo permite.
+
+Problemas comunes y troubleshooting
+- "No module named 'faster_whisper'": instala `faster-whisper` en tu entorno.
+- Coqui TTS tarda en descargar modelos la primera vez (pesan decenas de MB).
+- Si la síntesis TTS es lenta, verifica que `torch` esté instalado y que
+	`TTS` use un vocoder optimizado (HifiGAN es el por defecto en los modelos
+	de ejemplo).
+
+¿Quieres que haga esto por ti?
+- Puedo generar el MP4 final (reemplazando pista y/o quemando subtítulos),
+	o ajustar el pipeline de doblaje (p.ej. time-stretch suave). Indica qué
+	prefieres y lo ejecuto localmente en este workspace.
--- a/whisper_project/pycache/dub_and_burn.cpython-313.pyc
+++ b/whisper_project/pycache/dub_and_burn.cpython-313.pyc
--- a/whisper_project/pycache/process_video.cpython-313.pyc
+++ b/whisper_project/pycache/process_video.cpython-313.pyc
--- a/whisper_project/pycache/transcribe.cpython-313.pyc
+++ b/whisper_project/pycache/transcribe.cpython-313.pyc
--- a/whisper_project/coqui_test.wav
+++ b/whisper_project/coqui_test.wav
--- a/whisper_project/dailyrutines.audio.srt
+++ b/whisper_project/dailyrutines.audio.srt
@ -0,0 +1,56 @@
+1
+00:00:00,000 --> 00:00:10,000
+
+2
+00:00:10,000 --> 00:00:20,000
+
+3
+00:00:20,000 --> 00:00:30,000
+
+4
+00:00:30,000 --> 00:00:40,000
+
+5
+00:00:40,000 --> 00:00:50,000
+
+6
+00:00:50,000 --> 00:01:00,000
+
+7
+00:01:00,000 --> 00:01:10,000
+
+8
+00:01:10,000 --> 00:01:20,000
+
+9
+00:01:20,000 --> 00:01:30,000
+
+10
+00:01:30,000 --> 00:01:40,000
+
+11
+00:01:40,000 --> 00:01:50,000
+
+12
+00:01:50,000 --> 00:02:00,000
+
+13
+00:02:00,000 --> 00:02:10,000
+
+14
+00:02:10,000 --> 00:02:20,000
+
+15
+00:02:20,000 --> 00:02:30,000
+
+16
+00:02:30,000 --> 00:02:40,000
+
+17
+00:02:40,000 --> 00:02:50,000
+
+18
+00:02:50,000 --> 00:03:00,000
+
+19
+00:03:00,000 --> 00:03:09,009
--- a/whisper_project/dailyrutines.audio.wav
+++ b/whisper_project/dailyrutines.audio.wav
--- a/whisper_project/dailyrutines.kokoro.api.wav
+++ b/whisper_project/dailyrutines.kokoro.api.wav
--- a/whisper_project/dailyrutines.kokoro.dub.es.aligned.wav
+++ b/whisper_project/dailyrutines.kokoro.dub.es.aligned.wav
--- a/whisper_project/dailyrutines.kokoro.dub.es.srt
+++ b/whisper_project/dailyrutines.kokoro.dub.es.srt
@ -0,0 +1,72 @@
+1
+00:00:00,000 --> 00:00:06,960
+Rutinas diarias
+
+2
+00:00:06,960 --> 00:00:14,480
+Hola mamá, estoy disfrutando la vida en Nueva Zelanda.
+
+3
+00:00:14,480 --> 00:00:19,240
+El campo es tan hermoso.
+
+4
+00:00:19,240 --> 00:00:23,199
+Mi rutina es diferente ahora.
+
+5
+00:00:23,199 --> 00:00:29,960
+Me despierto a las 6 en punto cada mañana y salgo a correr.
+
+6
+00:00:29,960 --> 00:00:36,640
+A las 7 en punto desayuno.
+
+7
+00:00:36,640 --> 00:00:42,120
+El café en Nueva Zelanda es tan bueno.
+
+8
+00:00:42,120 --> 00:00:46,240
+A las 8 voy a trabajar.
+
+9
+00:00:46,240 --> 00:00:52,679
+Normalmente tomo el autobús, pero a veces camino.
+
+10
+00:00:52,679 --> 00:00:57,439
+Empiezo a trabajar a las 9.
+
+11
+00:00:57,439 --> 00:01:02,399
+Trabajo en mi oficina hasta la hora del almuerzo.
+
+12
+00:01:02,399 --> 00:01:08,920
+A las 12 almuerzo con mis colegas en el parque.
+
+13
+00:01:08,920 --> 00:01:15,239
+Es agradable disfrutar del aire fresco y charlar juntos.
+
+14
+00:01:15,239 --> 00:01:23,759
+A las 5 salgo del trabajo y voy al gimnasio.
+
+15
+00:01:23,760 --> 00:01:32,920
+Hago ejercicio hasta las seis y luego voy a casa.
+
+16
+00:01:32,920 --> 00:01:39,520
+A las 8 ceno, luego me relajo.
+
+17
+00:01:39,520 --> 00:01:44,800
+I normally go to bed at 11 o'clock.
+
+18
+00:01:44,799 --> 00:01:51,799
+Hasta pronto, Stephen.
+
--- a/whisper_project/dailyrutines.kokoro.dub.es.wav
+++ b/whisper_project/dailyrutines.kokoro.dub.es.wav
--- a/whisper_project/dailyrutines.kokoro.dub.srt
+++ b/whisper_project/dailyrutines.kokoro.dub.srt
@ -0,0 +1,71 @@
+1
+00:00:00,000 --> 00:00:06,960
+Dayly routines
+
+2
+00:00:06,960 --> 00:00:14,480
+Hi mom, I'm enjoying life in New Zealand.
+
+3
+00:00:14,480 --> 00:00:19,240
+The countryside is so beautiful.
+
+4
+00:00:19,240 --> 00:00:23,199
+My routine is different now.
+
+5
+00:00:23,199 --> 00:00:29,960
+I wake at 6 o'clock every morning and go for a run.
+
+6
+00:00:29,960 --> 00:00:36,640
+At 7 o'clock I have breakfast.
+
+7
+00:00:36,640 --> 00:00:42,120
+The coffee in New Zealand is so good.
+
+8
+00:00:42,120 --> 00:00:46,240
+At 8 o'clock I go to work.
+
+9
+00:00:46,240 --> 00:00:52,679
+I usually take the bus, but sometimes I walk.
+
+10
+00:00:52,679 --> 00:00:57,439
+I start work at 9 o'clock.
+
+11
+00:00:57,439 --> 00:01:02,399
+I work in my office until lunchtime.
+
+12
+00:01:02,399 --> 00:01:08,920
+At 12 o'clock I have lunch with my colleagues in the park.
+
+13
+00:01:08,920 --> 00:01:15,239
+It's nice to enjoy the fresh air and chat together.
+
+14
+00:01:15,239 --> 00:01:23,759
+At 5 o'clock I leave work and go to the gym.
+
+15
+00:01:23,760 --> 00:01:32,920
+I exercise until 6 o'clock and then go home.
+
+16
+00:01:32,920 --> 00:01:39,520
+At 8 o'clock I eat dinner, then relax.
+
+17
+00:01:39,520 --> 00:01:44,800
+I normally go to bed at 11 o'clock.
+
+18
+00:01:44,799 --> 00:01:51,799
+See you soon, Stephen.
--- a/whisper_project/dailyrutines.kokoro.dub.wav
+++ b/whisper_project/dailyrutines.kokoro.dub.wav
--- a/whisper_project/dub_and_burn.py
+++ b/whisper_project/dub_and_burn.py
@ -0,0 +1,484 @@
+#!/usr/bin/env python3
+"""
+dub_and_burn.py
+
+Flujo automatizado:
+- Extrae audio del vídeo
+- Transcribe y traduce con Whisper (usando process_video helpers)
+- Sintetiza cada segmento con Kokoro (/api/v1/audio/speech) usando voice=em_alex
+- Ajusta cada chunk a la duración del segmento (pad/trim)
+- Concatena los chunks y reemplaza la pista de audio en el vídeo
+- Genera SRT traducido y lo quema en el vídeo final
+
+Requisitos:
+- ffmpeg / ffprobe en PATH
+- Python venv del proyecto con requests y srt instalados (el venv se creó ya)
+
+Uso ejemplo:
+  python3 dub_and_burn.py --video input.mp4 --out out_dubbed.mp4 \
+    --kokoro-endpoint "https://kokoro.bfzqqk.easypanel.host/api/v1/audio/speech" \
+    --api-key "048665fa9596db326c17c6f5f84d7d03" \
+    --voice em_alex --model model_q8f16
+
+"""
+
+import argparse
+import json
+import os
+import shlex
+import shutil
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+from typing import List, Dict
+
+import requests
+import srt
+
+# Import translation/transcription helpers from process_video
+from whisper_project.process_video import (
+    extract_audio,
+    transcribe_and_translate_faster,
+    transcribe_and_translate_openai,
+    burn_subtitles,
+)
+
+# Use write_srt from transcribe module if available
+from whisper_project.transcribe import write_srt
+
+
+def ensure_ffmpeg():
+    if shutil.which("ffmpeg") is None or shutil.which("ffprobe") is None:
+        print("ffmpeg/ffprobe no encontrados en PATH. Instálalos.")
+        sys.exit(1)
+
+
+def get_duration(path: str) -> float:
+    cmd = [
+        "ffprobe",
+        "-v",
+        "error",
+        "-show_entries",
+        "format=duration",
+        "-of",
+        "default=noprint_wrappers=1:nokey=1",
+        path,
+    ]
+    p = subprocess.run(cmd, capture_output=True, text=True)
+    if p.returncode != 0:
+        return 0.0
+    try:
+        return float(p.stdout.strip())
+    except Exception:
+        return 0.0
+
+
+def pad_or_trim(in_path: str, out_path: str, target_duration: float, sr: int = 22050):
+    cur = get_duration(in_path)
+    if cur == 0.0:
+        # copy as-is
+        shutil.copy(in_path, out_path)
+        return True
+    if abs(cur - target_duration) < 0.02:
+        # casi igual
+        shutil.copy(in_path, out_path)
+        return True
+    if cur > target_duration:
+        # recortar
+        cmd = ["ffmpeg", "-y", "-i", in_path, "-t", f"{target_duration}", out_path]
+        subprocess.run(cmd, check=True)
+        return True
+    else:
+        # pad: crear silencio de duración faltante y concatenar
+        pad = target_duration - cur
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as sil:
+            sil_path = sil.name
+        try:
+            cmd1 = [
+                "ffmpeg",
+                "-y",
+                "-f",
+                "lavfi",
+                "-i",
+                f"anullsrc=channel_layout=mono:sample_rate={sr}",
+                "-t",
+                f"{pad}",
+                "-c:a",
+                "pcm_s16le",
+                sil_path,
+            ]
+            subprocess.run(cmd1, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+
+            # concat in_path + sil_path
+            with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as listf:
+                listf.write(f"file '{os.path.abspath(in_path)}'\n")
+                listf.write(f"file '{os.path.abspath(sil_path)}'\n")
+                listname = listf.name
+            cmd2 = ["ffmpeg", "-y", "-f", "concat", "-safe", "0", "-i", listname, "-c", "copy", out_path]
+            subprocess.run(cmd2, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        finally:
+            try:
+                os.remove(sil_path)
+            except Exception:
+                pass
+            try:
+                os.remove(listname)
+            except Exception:
+                pass
+        return True
+
+
+def synthesize_segment_kokoro(endpoint: str, api_key: str, model: str, voice: str, text: str) -> bytes:
+    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json", "Accept": "*/*"}
+    payload = {"model": model, "voice": voice, "input": text, "response_format": "wav"}
+    r = requests.post(endpoint, json=payload, headers=headers, timeout=120)
+    r.raise_for_status()
+    # si viene audio
+    ctype = r.headers.get("Content-Type", "")
+    if ctype.startswith("audio/"):
+        return r.content
+    # intentar JSON base64
+    try:
+        j = r.json()
+        for k in ("audio", "wav", "data", "base64"):
+            if k in j:
+                import base64
+
+                return base64.b64decode(j[k])
+    except Exception:
+        pass
+    # fallback
+    return r.content
+
+
+def translate_with_gemini(text: str, target_lang: str, api_key: str, model: str = "gemini-2.5-flash") -> str:
+    """Usa la API HTTP de Gemini para traducir un texto al idioma objetivo.
+
+    Notas:
+    - Se asume un endpoint compatible con la API de Google Gemini HTTP (OpenAI-like).
+    - El parámetro `model` por defecto es 'gemini-2.5-flash' según solicitud.
+    """
+    # Endpoint público de ejemplo: https://api.openai.com/v1/responses
+    # Usamos la ruta /v1/responses que muchas instalaciones usan; si tu instancia Gemini requiere otra URL,
+    # pásala modificando la función (o la env var GEMINI_ENDPOINT).
+    # Si la API key parece una clave de Google (empieza con 'AIza'), usar
+    # la API Generative Language de Google con key en query param.
+    try:
+        if api_key and api_key.startswith("AIza"):
+            gl_model = model
+            # Formato: https://generativelanguage.googleapis.com/v1beta2/models/{model}:generate?key=API_KEY
+            gl_endpoint = (
+                f"https://generativelanguage.googleapis.com/v1beta2/models/{gl_model}:generateContent?key={api_key}"
+            )
+            body = {
+                "prompt": {"text": f"Traduce al {target_lang} el siguiente texto, devuelve solo el texto traducido:\n\n{text}"},
+                "maxOutputTokens": 1024,
+                "temperature": 0.0,
+                "candidateCount": 1,
+            }
+            r = requests.post(gl_endpoint, json=body, timeout=20)
+            r.raise_for_status()
+            j = r.json()
+            # la respuesta suele tener 'candidates' con 'content'
+            if isinstance(j, dict):
+                if "candidates" in j and isinstance(j["candidates"], list) and j["candidates"]:
+                    first = j["candidates"][0]
+                    if isinstance(first, dict):
+                        # varios formatos posibles
+                        if "content" in first and isinstance(first["content"], str):
+                            return first["content"].strip()
+                        if "output" in first and isinstance(first["output"], str):
+                            return first["output"].strip()
+                        # content puede ser una lista de bloques
+                        if "content" in first and isinstance(first["content"], list):
+                            # buscar textos dentro
+                            parts = []
+                            for c in first["content"]:
+                                if isinstance(c, dict) and isinstance(c.get("text"), str):
+                                    parts.append(c.get("text"))
+                            if parts:
+                                return "\n".join(parts).strip()
+                # fallback buscar fields comunes
+                for key in ("output_text", "text", "response", "translated_text"):
+                    if key in j and isinstance(j[key], str):
+                        return j[key].strip()
+            return text
+
+        # Si no es Google API key, intentar API OpenAI-like (Responses)
+        gemini_endpoint = os.environ.get("GEMINI_ENDPOINT", "https://api.openai.com/v1/responses")
+        headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
+        prompt = (
+            f"Traduce el siguiente texto al idioma {target_lang}. Mantén solo el texto traducido, sin añadidos:\n\n{text}"
+        )
+        payload = {"model": model, "input": prompt, "max_output_tokens": 1024}
+        r = requests.post(gemini_endpoint, json=payload, headers=headers, timeout=20)
+        r.raise_for_status()
+        j = r.json()
+        if isinstance(j, dict):
+            if "output" in j and isinstance(j["output"], list):
+                for item in j["output"]:
+                    if isinstance(item, dict) and "content" in item:
+                        cont = item["content"]
+                        if isinstance(cont, list):
+                            texts = [c.get("text") for c in cont if isinstance(c, dict) and c.get("text")]
+                            if texts:
+                                return "\n".join(texts).strip()
+                        elif isinstance(cont, str):
+                            return cont.strip()
+            for key in ("output_text", "text", "response", "translated_text"):
+                if key in j and isinstance(j[key], str):
+                    return j[key].strip()
+        if isinstance(j, list) and j:
+            if isinstance(j[0], str):
+                return j[0]
+        if isinstance(j, str):
+            return j
+    except Exception as e:
+        print(f"Warning: Gemini translation failed: {e}")
+
+    return text
+
+
+def concat_chunks(chunk_files: List[str], out_path: str):
+    with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as listf:
+        for c in chunk_files:
+            listf.write(f"file '{os.path.abspath(c)}'\n")
+        listname = listf.name
+    cmd = ["ffmpeg", "-y", "-f", "concat", "-safe", "0", "-i", listname, "-c", "copy", out_path]
+    subprocess.run(cmd, check=True)
+    try:
+        os.remove(listname)
+    except Exception:
+        pass
+
+
+def replace_audio_in_video(video_path: str, audio_path: str, out_video: str):
+    cmd = [
+        "ffmpeg",
+        "-y",
+        "-i",
+        video_path,
+        "-i",
+        audio_path,
+        "-map",
+        "0:v:0",
+        "-map",
+        "1:a:0",
+        "-c:v",
+        "copy",
+        "-c:a",
+        "aac",
+        "-b:a",
+        "192k",
+        "-shortest",
+        out_video,
+    ]
+    subprocess.run(cmd, check=True)
+
+
+def normalize_segments(segments) -> List[Dict]:
+    out = []
+    for s in segments:
+        if isinstance(s, dict):
+            start = s.get("start")
+            end = s.get("end")
+            text = s.get("text", "")
+        else:
+            # faster-whisper Segment object
+            start = getattr(s, "start", None)
+            end = getattr(s, "end", None)
+            text = getattr(s, "text", "")
+        if start is None or end is None:
+            continue
+        out.append({"start": float(start), "end": float(end), "text": str(text).strip()})
+    return out
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Doblar vídeo usando Kokoro y quemar SRT traducido")
+    parser.add_argument("--video", "-v", required=True)
+    parser.add_argument("--out", "-o", default=None, help="Vídeo de salida final (con audio reemplazado y SRT quemado)")
+    parser.add_argument("--temp-dub", default=None, help="Archivo de audio temporal generado (si quieres conservarlo)")
+    parser.add_argument("--kokoro-endpoint", required=True, help="URL al endpoint /api/v1/audio/speech")
+    parser.add_argument("--api-key", required=True, help="Token para Authorization: Bearer <token>")
+    parser.add_argument("--model", default="model", help="Modelo Kokoro a usar (usa 'model' fp32 326MB)")
+    parser.add_argument("--voice", default="em_alex", help="Voice id a usar (em_alex)")
+    parser.add_argument(
+        "--whisper-backend",
+        choices=["faster-whisper", "openai-whisper"],
+        default="faster-whisper",
+    )
+    parser.add_argument("--whisper-model", default="base")
+
+    # Gemini options
+    parser.add_argument(
+        "--use-gemini",
+        action="store_true",
+        help="Usar Gemini (HTTP) para traducir segmentos en lugar de Whisper translate",
+    )
+    parser.add_argument("--gemini-api-key", default=None, help="API key para Gemini (Bearer)")
+    parser.add_argument(
+        "--gemini-model",
+        default="gemini-2.5-flash",
+        help="Modelo Gemini a usar (por defecto: gemini-2.5-flash)",
+    )
+
+    args = parser.parse_args()
+
+    ensure_ffmpeg()
+
+    video = Path(args.video)
+    if not video.exists():
+        print("Vídeo no encontrado", file=sys.stderr)
+        sys.exit(2)
+
+    out_video = args.out if args.out else str(video.with_name(video.stem + "_dubbed.mp4"))
+    tmpdir = tempfile.mkdtemp(prefix="dub_and_burn_")
+
+    try:
+        audio_wav = os.path.join(tmpdir, "extracted_audio.wav")
+        print("Extrayendo audio...")
+        extract_audio(str(video), audio_wav)
+
+        print("Transcribiendo (y traduciendo si no se usa Gemini) ...")
+
+        # Si se solicita Gemini, hacemos transcribe-only y luego traducimos por segmento con Gemini
+        if args.use_gemini:
+            # permitir pasar la key por variable de entorno GEMINI_API_KEY
+            if not args.gemini_api_key:
+                args.gemini_api_key = os.environ.get("GEMINI_API_KEY")
+            if not args.gemini_api_key:
+                print("--use-gemini requiere --gemini-api-key o la var de entorno GEMINI_API_KEY", file=sys.stderr)
+                sys.exit(4)
+            # transcribir sin traducir
+            from faster_whisper import WhisperModel
+
+            wm = WhisperModel(args.whisper_model, device="cpu", compute_type="int8")
+            segments, info = wm.transcribe(audio_wav, beam_size=5, task="transcribe")
+        else:
+            if args.whisper_backend == "faster-whisper":
+                segments = transcribe_and_translate_faster(audio_wav, args.whisper_model, "es")
+            else:
+                segments = transcribe_and_translate_openai(audio_wav, args.whisper_model, "es")
+
+        if not segments:
+            print("No se obtuvieron segmentos; abortando", file=sys.stderr)
+            sys.exit(3)
+
+        segs = normalize_segments(segments)
+
+        # si usamos gemini, traducir por segmento ahora
+        if args.use_gemini:
+            print(f"Traduciendo {len(segs)} segmentos con Gemini (model={args.gemini_model})...")
+            for s in segs:
+                try:
+                    src = s.get("text", "")
+                    if src:
+                        tgt = translate_with_gemini(src, "es", args.gemini_api_key, model=args.gemini_model)
+                        s["text"] = tgt
+                except Exception as e:
+                    print(f"Warning: Gemini fallo en segmento: {e}")
+
+        # generar SRT traducido
+        srt_out = os.path.join(tmpdir, "translated.srt")
+        srt_segments = []
+        for i, s in enumerate(segs, start=1):
+            srt_segments.append(s)
+        write_srt(srt_segments, srt_out)
+        print(f"SRT traducido guardado en: {srt_out}")
+
+        # sintetizar por segmento
+        chunk_files = []
+        print(f"Sintetizando {len(segs)} segmentos con Kokoro (voice={args.voice})...")
+        for i, s in enumerate(segs, start=1):
+            text = s.get("text", "")
+            if not text:
+                # generar silencio con la duración del segmento
+                target_dur = s["end"] - s["start"]
+                silent = os.path.join(tmpdir, f"chunk_{i:04d}.wav")
+                cmd = [
+                    "ffmpeg",
+                    "-y",
+                    "-f",
+                    "lavfi",
+                    "-i",
+                    "anullsrc=channel_layout=mono:sample_rate=22050",
+                    "-t",
+                    f"{target_dur}",
+                    "-c:a",
+                    "pcm_s16le",
+                    silent,
+                ]
+                subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+                chunk_files.append(silent)
+                print(f" - Segmento {i}: silencio {target_dur}s")
+                continue
+
+            try:
+                raw = synthesize_segment_kokoro(args.kokoro_endpoint, args.api_key, args.model, args.voice, text)
+            except Exception as e:
+                print(f"Error sintetizando segmento {i}: {e}")
+                # fallback: generar silencio
+                target_dur = s["end"] - s["start"]
+                silent = os.path.join(tmpdir, f"chunk_{i:04d}.wav")
+                cmd = [
+                    "ffmpeg",
+                    "-y",
+                    "-f",
+                    "lavfi",
+                    "-i",
+                    "anullsrc=channel_layout=mono:sample_rate=22050",
+                    "-t",
+                    f"{target_dur}",
+                    "-c:a",
+                    "pcm_s16le",
+                    silent,
+                ]
+                subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+                chunk_files.append(silent)
+                continue
+
+            # guardar raw en temp file
+            tmp_chunk = os.path.join(tmpdir, f"raw_chunk_{i:04d}.bin")
+            with open(tmp_chunk, "wb") as f:
+                f.write(raw)
+
+            # convertir a WAV estandar (22050 mono)
+            tmp_wav = os.path.join(tmpdir, f"tmp_chunk_{i:04d}.wav")
+            cmdc = ["ffmpeg", "-y", "-i", tmp_chunk, "-ar", "22050", "-ac", "1", "-sample_fmt", "s16", tmp_wav]
+            subprocess.run(cmdc, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+
+            # ajustar a la duración del segmento
+            target_dur = s["end"] - s["start"]
+            final_chunk = os.path.join(tmpdir, f"chunk_{i:04d}.wav")
+            pad_or_trim(tmp_wav, final_chunk, target_dur, sr=22050)
+            chunk_files.append(final_chunk)
+            print(f" - Segmento {i}/{len(segs)} -> {os.path.basename(final_chunk)}")
+
+        # concatenar chunks
+        dub_wav = args.temp_dub if args.temp_dub else os.path.join(tmpdir, "dub_final.wav")
+        print("Concatenando chunks...")
+        concat_chunks(chunk_files, dub_wav)
+        print(f"Archivo dub generado en: {dub_wav}")
+
+        # reemplazar audio en el vídeo
+        replaced = os.path.join(tmpdir, "video_replaced.mp4")
+        print("Reemplazando pista de audio en el vídeo...")
+        replace_audio_in_video(str(video), dub_wav, replaced)
+
+        # quemar SRT traducido
+        print("Quemando SRT traducido en el vídeo...")
+        burn_subtitles(replaced, srt_out, out_video)
+
+        print(f"Vídeo final generado: {out_video}")
+
+    finally:
+        try:
+            shutil.rmtree(tmpdir)
+        except Exception:
+            pass
+
+
+if __name__ == '__main__':
+    main()
--- a/whisper_project/dub_female_clone_es.wav
+++ b/whisper_project/dub_female_clone_es.wav
--- a/whisper_project/dub_male_clone_ptbr.wav
+++ b/whisper_project/dub_male_clone_ptbr.wav
--- a/whisper_project/dub_male_style.wav
+++ b/whisper_project/dub_male_style.wav
--- a/whisper_project/dub_male_style_out.wav
+++ b/whisper_project/dub_male_style_out.wav
--- a/whisper_project/process_video.py
+++ b/whisper_project/process_video.py
@ -0,0 +1,179 @@
+#!/usr/bin/env python3
+"""Procesamiento de vídeo: extrae audio, transcribe/traduce y
+quema subtítulos.
+
+Flujo:
+- Extrae audio con ffmpeg (WAV 16k mono)
+- Transcribe con faster-whisper o openai-whisper
+    (opción task='translate')
+- Escribe SRT y lo incrusta en el vídeo con ffmpeg
+
+Nota: requiere ffmpeg instalado y, para modelos, faster-whisper
+o openai-whisper.
+"""
+import argparse
+import subprocess
+import tempfile
+from pathlib import Path
+import sys
+
+from transcribe import write_srt
+
+
+def extract_audio(video_path: str, out_audio: str):
+    cmd = [
+        "ffmpeg",
+        "-y",
+        "-i",
+        video_path,
+        "-vn",
+        "-acodec",
+        "pcm_s16le",
+        "-ar",
+        "16000",
+        "-ac",
+        "1",
+        out_audio,
+    ]
+    subprocess.run(cmd, check=True)
+
+
+def burn_subtitles(video_path: str, srt_path: str, out_video: str):
+    # Usar filtro subtitles de ffmpeg
+    cmd = [
+        "ffmpeg",
+        "-y",
+        "-i",
+        video_path,
+        "-vf",
+        f"subtitles={srt_path}",
+        "-c:a",
+        "copy",
+        out_video,
+    ]
+    subprocess.run(cmd, check=True)
+
+
+def transcribe_and_translate_faster(audio_path: str, model: str, target: str):
+    from faster_whisper import WhisperModel
+
+    wm = WhisperModel(model, device="cpu", compute_type="int8")
+    segments, info = wm.transcribe(
+        audio_path, beam_size=5, task="translate", language=target
+    )
+    return segments
+
+
+def transcribe_and_translate_openai(audio_path: str, model: str, target: str):
+    import whisper
+
+    m = whisper.load_model(model, device="cpu")
+    result = m.transcribe(
+        audio_path, fp16=False, task="translate", language=target
+    )
+    return result.get("segments", None)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Extraer, transcribir/traducir y quemar subtítulos en vídeo"
+            " (offline)"
+        )
+    )
+    parser.add_argument(
+        "--video", "-v", required=True, help="Ruta del archivo de vídeo"
+    )
+    parser.add_argument(
+        "--backend",
+        "-b",
+        choices=["faster-whisper", "openai-whisper"],
+        default="faster-whisper",
+    )
+    parser.add_argument(
+        "--model",
+        "-m",
+        default="base",
+        help="Modelo de whisper a usar (tiny, base, etc.)",
+    )
+    parser.add_argument(
+        "--to", "-t", default="es", help="Idioma de destino para traducción"
+    )
+    parser.add_argument(
+        "--out",
+        "-o",
+        default=None,
+        help=(
+            "Ruta del vídeo de salida (si no se especifica,"
+            " se usa input_burned.mp4)"
+        ),
+    )
+    parser.add_argument(
+        "--srt",
+        default=None,
+        help=(
+            "Ruta SRT a escribir (si no se especifica,"
+            " se usa input.srt)"
+        ),
+    )
+
+    args = parser.parse_args()
+
+    video = Path(args.video)
+    if not video.exists():
+        print("Vídeo no encontrado", file=sys.stderr)
+        sys.exit(2)
+
+    out_video = (
+        args.out
+        if args.out
+        else str(video.with_name(video.stem + "_burned.mp4"))
+    )
+    srt_path = args.srt if args.srt else str(video.with_suffix('.srt'))
+
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
+        audio_path = tmp.name
+
+    try:
+        print("Extrayendo audio con ffmpeg...")
+        extract_audio(str(video), audio_path)
+
+        print(
+            f"Transcribiendo y traduciendo a '{args.to}'"
+            f" usando {args.backend}..."
+        )
+        if args.backend == "faster-whisper":
+            segments = transcribe_and_translate_faster(
+                audio_path, args.model, args.to
+            )
+        else:
+            segments = transcribe_and_translate_openai(
+                audio_path, args.model, args.to
+            )
+
+        if not segments:
+            print(
+                "No se obtuvieron segmentos de la transcripción",
+                file=sys.stderr,
+            )
+            sys.exit(3)
+
+        print(f"Escribiendo SRT en {srt_path}...")
+        write_srt(segments, srt_path)
+
+        print(
+            f"Quemando subtítulos en el vídeo -> {out_video}"
+            f" (esto puede tardar)..."
+        )
+        burn_subtitles(str(video), srt_path, out_video)
+
+        print("Proceso completado.")
+    finally:
+        try:
+            Path(audio_path).unlink()
+        except Exception:
+            pass
+
+
+if __name__ == "__main__":
+    main()
--- a/whisper_project/ref_female_es.wav
+++ b/whisper_project/ref_female_es.wav
--- a/whisper_project/requirements.txt
+++ b/whisper_project/requirements.txt
@ -0,0 +1,12 @@
+# Dependencias básicas para ejecutar Whisper en CPU
+torch>=1.12.0
+ffmpeg-python
+numpy
+# Optional backends (comment/uncomment as needed)
+openai-whisper
+transformers
+faster-whisper
+# TTS (opcional)
+TTS
+pyttsx3
+huggingface-hub
--- a/whisper_project/run_xtts_clone.py
+++ b/whisper_project/run_xtts_clone.py
@ -0,0 +1,17 @@
+import os, traceback
+from TTS.api import TTS
+
+out='whisper_project/dub_female_xtts_es.wav'
+speaker='whisper_project/ref_female_es.wav'
+text='Hola, esta es una prueba de clonación usando xtts_v2 en español latino.'
+model='tts_models/multilingual/multi-dataset/xtts_v2'
+
+try:
+    print('Cargando modelo:', model)
+    tts = TTS(model_name=model, progress_bar=True, gpu=False)
+    print('Llamando a tts_to_file con speaker_wav=', speaker)
+    tts.tts_to_file(text=text, file_path=out, speaker_wav=speaker, language='es')
+    print('Generado:', out, 'size=', os.path.getsize(out))
+except Exception as e:
+    print('Error durante la clonación:')
+    traceback.print_exc()
--- a/whisper_project/srt_to_kokoro.py
+++ b/whisper_project/srt_to_kokoro.py
@ -0,0 +1,492 @@
+#!/usr/bin/env python3
+"""
+srt_to_kokoro.py
+
+Leer un archivo .srt y sintetizar cada subtítulo usando una API OpenAPI-compatible (p. ej. Kokoro).
+- Intenta autodetectar un endpoint de síntesis en `--openapi` (URL JSON) buscando paths que contengan 'synth'|'tts'|'text' y que acepten POST.
+- Alternativamente usa `--endpoint` y un `--payload-template` con {text} como placeholder.
+- Guarda fragmentos temporales y los concatena con ffmpeg en un único WAV de salida.
+
+Dependencias: requests, srt (pip install requests srt)
+Requiere ffmpeg en PATH.
+
+Ejemplos:
+  python srt_to_kokoro.py --srt subs.srt --openapi "https://kokoro.../openapi.json" --voice "alloy" --out out.wav --api-key "TOKEN"
+  python srt_to_kokoro.py --srt subs.srt --endpoint "https://kokoro.../v1/synthesize" --payload-template '{"text": "{text}", "voice": "alloy"}' --out out.wav
+
+"""
+
+import argparse
+import json
+import os
+import re
+import shutil
+import subprocess
+import sys
+import tempfile
+from typing import Optional
+
+try:
+    import requests
+except Exception as e:
+    print("Este script requiere la librería 'requests'. Instálala con: pip install requests")
+    raise
+
+try:
+    import srt
+except Exception:
+    print("Este script requiere la librería 'srt'. Instálala con: pip install srt")
+    raise
+
+
+def find_synthesis_endpoint(openapi_url: str) -> Optional[str]:
+    """Intento heurístico: baja openapi.json y busca paths con 'synth'|'tts'|'text' que soporten POST."""
+    try:
+        r = requests.get(openapi_url, timeout=20)
+        r.raise_for_status()
+        spec = r.json()
+    except Exception as e:
+        print(f"No pude leer openapi.json desde {openapi_url}: {e}")
+        return None
+
+    paths = spec.get("paths", {})
+    candidate = None
+    for path, methods in paths.items():
+        lname = path.lower()
+        if any(k in lname for k in ("synth", "tts", "text", "synthesize")):
+            for method, op in methods.items():
+                if method.lower() == "post":
+                    # candidato
+                    candidate = path
+                    break
+        if candidate:
+            break
+
+    if not candidate:
+        # fallback: scan operationId or summary
+        for path, methods in paths.items():
+            for method, op in methods.items():
+                meta = json.dumps(op).lower()
+                if any(k in meta for k in ("synth", "tts", "text", "synthesize")) and method.lower() == "post":
+                    candidate = path
+                    break
+            if candidate:
+                break
+
+    if not candidate:
+        return None
+
+    # Construir base url desde openapi_url
+    from urllib.parse import urlparse, urljoin
+    p = urlparse(openapi_url)
+    base = f"{p.scheme}://{p.netloc}"
+    return urljoin(base, candidate)
+
+
+def parse_srt_file(path: str):
+    with open(path, "r", encoding="utf-8") as f:
+        raw = f.read()
+    subs = list(srt.parse(raw))
+    return subs
+
+
+def synth_chunk(endpoint: str, text: str, headers: dict, payload_template: Optional[str], timeout=60):
+    """Envía la solicitud y devuelve bytes de audio. Maneja respuestas audio/* o JSON con campo base64."""
+    # Construir payload
+    if payload_template:
+        body = payload_template.replace("{text}", text)
+        try:
+            json_body = json.loads(body)
+        except Exception:
+            # enviar como texto plano
+            json_body = {"text": text}
+    else:
+        json_body = {"text": text}
+
+    # Realizar POST
+    r = requests.post(endpoint, json=json_body, headers=headers, timeout=timeout)
+    r.raise_for_status()
+
+    ctype = r.headers.get("Content-Type", "")
+    if ctype.startswith("audio/"):
+        return r.content
+    # Si viene JSON con base64
+    try:
+        j = r.json()
+        # buscar campos con 'audio' o 'wav' o 'base64'
+        for k in ("audio", "wav", "data", "base64"):
+            if k in j:
+                val = j[k]
+                # si es base64
+                import base64
+                try:
+                    return base64.b64decode(val)
+                except Exception:
+                    # tal vez ya es bytes hex u otra cosa
+                    pass
+    except Exception:
+        pass
+
+    # Fallback: devolver raw bytes
+    return r.content
+
+
+def ensure_ffmpeg():
+    if shutil.which("ffmpeg") is None:
+        print("ffmpeg no está disponible en PATH. Instálalo para poder concatenar/convertir audios.")
+        sys.exit(1)
+
+
+def convert_and_save(raw_bytes: bytes, target_path: str):
+    """Guarda bytes a un archivo temporal y convierte a WAV PCM 16k mono usando ffmpeg."""
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".bin") as tmp:
+        tmp.write(raw_bytes)
+        tmp.flush()
+        tmp_path = tmp.name
+
+    # Convertir con ffmpeg a WAV 22050 Hz mono 16-bit
+    cmd = [
+        "ffmpeg", "-y", "-i", tmp_path,
+        "-ar", "22050", "-ac", "1", "-sample_fmt", "s16", target_path
+    ]
+    try:
+        subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    except subprocess.CalledProcessError as e:
+        print(f"ffmpeg falló al convertir chunk: {e}")
+        # como fallback, escribir los bytes "crudos"
+        with open(target_path, "wb") as out:
+            out.write(raw_bytes)
+    finally:
+        try:
+            os.remove(tmp_path)
+        except Exception:
+            pass
+
+
+def create_silence(duration: float, out_path: str, sr: int = 22050):
+    """Create a silent wav of given duration (seconds) at sr and save to out_path."""
+    # use ffmpeg anullsrc
+    cmd = [
+        "ffmpeg",
+        "-y",
+        "-f",
+        "lavfi",
+        "-i",
+        f"anullsrc=channel_layout=mono:sample_rate={sr}",
+        "-t",
+        f"{duration}",
+        "-c:a",
+        "pcm_s16le",
+        out_path,
+    ]
+    try:
+        subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    except subprocess.CalledProcessError:
+        # fallback: write tiny silence by creating zero bytes
+        try:
+            with open(out_path, "wb") as fh:
+                fh.write(b"\x00" * 1024)
+        except Exception:
+            pass
+
+
+def pad_or_trim_wav(in_path: str, out_path: str, target_duration: float, sr: int = 22050):
+    """Pad with silence or trim input wav to match target_duration (seconds)."""
+    # get duration
+    try:
+        p = subprocess.run([
+            "ffprobe",
+            "-v",
+            "error",
+            "-show_entries",
+            "format=duration",
+            "-of",
+            "default=noprint_wrappers=1:nokey=1",
+            in_path,
+        ], capture_output=True, text=True)
+        cur = float(p.stdout.strip())
+    except Exception:
+        cur = 0.0
+
+    if cur == 0.0:
+        shutil.copy(in_path, out_path)
+        return
+
+    if abs(cur - target_duration) < 0.02:
+        shutil.copy(in_path, out_path)
+        return
+
+    if cur > target_duration:
+        cmd = ["ffmpeg", "-y", "-i", in_path, "-t", f"{target_duration}", out_path]
+        subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        return
+
+    # pad: create silence of missing duration and concat
+    pad = target_duration - cur
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as sil:
+        sil_path = sil.name
+    try:
+        create_silence(pad, sil_path, sr=sr)
+        # concat in_path + sil_path
+        with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as listf:
+            listf.write(f"file '{os.path.abspath(in_path)}'\n")
+            listf.write(f"file '{os.path.abspath(sil_path)}'\n")
+            listname = listf.name
+        cmd2 = ["ffmpeg", "-y", "-f", "concat", "-safe", "0", "-i", listname, "-c", "copy", out_path]
+        subprocess.run(cmd2, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    finally:
+        try:
+            os.remove(sil_path)
+        except Exception:
+            pass
+        try:
+            os.remove(listname)
+        except Exception:
+            pass
+
+
+def concat_chunks(chunks: list, out_path: str):
+    # Crear lista para ffmpeg concat demuxer
+    ensure_ffmpeg()
+    with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as listf:
+        for c in chunks:
+            listf.write(f"file '{os.path.abspath(c)}'\n")
+        listname = listf.name
+
+    cmd = ["ffmpeg", "-y", "-f", "concat", "-safe", "0", "-i", listname, "-c", "copy", out_path]
+    try:
+        subprocess.run(cmd, check=True)
+    except subprocess.CalledProcessError:
+        # fallback: concatenar mediante reconversión
+        tmp_concat = out_path + ".tmp.wav"
+        cmd2 = ["ffmpeg", "-y", "-i", f"concat:{'|'.join(chunks)}", "-c", "copy", tmp_concat]
+        subprocess.run(cmd2)
+        shutil.move(tmp_concat, out_path)
+    finally:
+        try:
+            os.remove(listname)
+        except Exception:
+            pass
+
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--srt", required=True, help="Ruta al archivo .srt traducido")
+    p.add_argument("--openapi", required=False, help="URL al openapi.json de Kokoro (intenta autodetectar endpoint)")
+    p.add_argument("--endpoint", required=False, help="URL directa del endpoint de síntesis (usa esto si autodetección falla)")
+    p.add_argument(
+        "--payload-template",
+        required=False,
+        help='Plantilla JSON para el payload con {text} como placeholder, ejemplo: "{\"text\": \"{text}\", \"voice\": \"alloy\"}"',
+    )
+    p.add_argument("--api-key", required=False, help="Valor para autorización (se envía como header Authorization: Bearer <key>)")
+    p.add_argument("--voice", required=False, help="Nombre de voz si aplica (se añade al payload si se usa template)")
+    p.add_argument("--out", required=True, help="Ruta de salida WAV final")
+    p.add_argument(
+        "--video",
+        required=False,
+        help="Ruta al vídeo original (necesario si quieres mezclar el audio con la pista original).",
+    )
+    p.add_argument(
+        "--mix-with-original",
+        action="store_true",
+        help="Mezclar el WAV generado con la pista de audio original del vídeo (usa --video).",
+    )
+    p.add_argument(
+        "--mix-background-volume",
+        type=float,
+        default=0.2,
+        help="Volumen de la pista original al mezclar (0.0-1.0), por defecto 0.2",
+    )
+    p.add_argument(
+        "--replace-original",
+        action="store_true",
+        help="Reemplazar la pista de audio del vídeo original por el WAV generado (usa --video).",
+    )
+    p.add_argument(
+        "--align",
+        action="store_true",
+        help="Generar silencios para alinear segmentos con los timestamps del SRT (inserta gaps entre segmentos).",
+    )
+    p.add_argument(
+        "--keep-chunks",
+        action="store_true",
+        help="Conservar los WAV de cada segmento en el directorio temporal (útil para debugging).",
+    )
+    args = p.parse_args()
+
+    headers = {"Accept": "*/*"}
+    if args.api_key:
+        headers["Authorization"] = f"Bearer {args.api_key}"
+
+    endpoint = args.endpoint
+    if not endpoint and args.openapi:
+        print("Intentando detectar endpoint desde openapi.json...")
+        endpoint = find_synthesis_endpoint(args.openapi)
+        if endpoint:
+            print(f"Usando endpoint detectado: {endpoint}")
+        else:
+            print("No se detectó endpoint automáticamente. Pasa --endpoint o --payload-template.")
+            sys.exit(1)
+
+    if not endpoint:
+        print("Debes proporcionar --endpoint o --openapi para que el script funcione.")
+        sys.exit(1)
+
+    subs = parse_srt_file(args.srt)
+    tmpdir = tempfile.mkdtemp(prefix="srt_kokoro_")
+    chunk_files = []
+
+    print(f"Sintetizando {len(subs)} segmentos...")
+    prev_end = 0.0
+    for i, sub in enumerate(subs, start=1):
+        text = re.sub(r"\s+", " ", sub.content.strip())
+        if not text:
+            prev_end = sub.end.total_seconds()
+            continue
+
+        start_sec = sub.start.total_seconds()
+        end_sec = sub.end.total_seconds()
+        duration = end_sec - start_sec
+
+        # if align requested, insert silence for gap between previous end and current start
+        if args.align:
+            gap = start_sec - prev_end
+            if gap > 0.01:
+                sil_target = os.path.join(tmpdir, f"sil_{i:04d}.wav")
+                create_silence(gap, sil_target)
+                chunk_files.append(sil_target)
+
+        try:
+            raw = synth_chunk(endpoint, text, headers, args.payload_template)
+        except Exception as e:
+            print(f"Error al sintetizar segmento {i}: {e}")
+            prev_end = end_sec
+            continue
+
+        target = os.path.join(tmpdir, f"chunk_{i:04d}.wav")
+        convert_and_save(raw, target)
+
+        # If align: pad or trim to subtitle duration, otherwise keep raw chunk
+        if args.align:
+            aligned = os.path.join(tmpdir, f"chunk_{i:04d}.aligned.wav")
+            pad_or_trim_wav(target, aligned, duration)
+            # replace target with aligned file in list
+            chunk_files.append(aligned)
+            # remove original raw chunk unless keep-chunks
+            if not args.keep_chunks:
+                try:
+                    os.remove(target)
+                except Exception:
+                    pass
+        else:
+            chunk_files.append(target)
+
+        prev_end = end_sec
+        print(f" - Segmento {i}/{len(subs)} -> {os.path.basename(chunk_files[-1])}")
+
+    if not chunk_files:
+        print("No se generaron fragmentos de audio. Abortando.")
+        shutil.rmtree(tmpdir, ignore_errors=True)
+        sys.exit(1)
+
+    print("Concatenando fragments...")
+    concat_chunks(chunk_files, args.out)
+    print(f"Archivo final generado en: {args.out}")
+
+    # Si el usuario pidió mezclar con la pista original del vídeo
+    if args.mix_with_original:
+        if not args.video:
+            print("--mix-with-original requiere que pases --video con la ruta del vídeo original.")
+        else:
+            # extraer audio del vídeo original a wav temporal (mono 22050)
+            orig_tmp = os.path.join(tempfile.gettempdir(), f"orig_audio_{os.getpid()}.wav")
+            mixed_tmp = os.path.join(tempfile.gettempdir(), f"mixed_audio_{os.getpid()}.wav")
+            try:
+                cmd_ext = [
+                    "ffmpeg",
+                    "-y",
+                    "-i",
+                    args.video,
+                    "-vn",
+                    "-ar",
+                    "22050",
+                    "-ac",
+                    "1",
+                    "-sample_fmt",
+                    "s16",
+                    orig_tmp,
+                ]
+                subprocess.run(cmd_ext, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+
+                # Mezclar: new audio (args.out) en primer plano, original a volumen reducido
+                vol = float(args.mix_background_volume)
+                # construir filtro: [0:a]volume=1[a1];[1:a]volume=vol[a0];[a1][a0]amix=inputs=2:duration=first:weights=1 vol [mix]
+                filter_complex = f"[0:a]volume=1[a1];[1:a]volume={vol}[a0];[a1][a0]amix=inputs=2:duration=first:weights=1 {vol}[mix]"
+                # usar ffmpeg para mezclar y generar mixed_tmp
+                cmd_mix = [
+                    "ffmpeg",
+                    "-y",
+                    "-i",
+                    args.out,
+                    "-i",
+                    orig_tmp,
+                    "-filter_complex",
+                    f"[0:a]volume=1[a1];[1:a]volume={vol}[a0];[a1][a0]amix=inputs=2:duration=first:dropout_transition=0[mix]",
+                    "-map",
+                    "[mix]",
+                    "-c:a",
+                    "pcm_s16le",
+                    mixed_tmp,
+                ]
+                subprocess.run(cmd_mix, check=True)
+
+                # reemplazar args.out con mixed_tmp
+                shutil.move(mixed_tmp, args.out)
+                print(f"Archivo mezclado generado en: {args.out}")
+            except subprocess.CalledProcessError as e:
+                print(f"Error al mezclar audio con la pista original: {e}")
+            finally:
+                try:
+                    if os.path.exists(orig_tmp):
+                        os.remove(orig_tmp)
+                except Exception:
+                    pass
+
+    # Si se solicita reemplazar la pista original en el vídeo
+    if args.replace_original:
+        if not args.video:
+            print("--replace-original requiere que pases --video con la ruta del vídeo original.")
+        else:
+            out_video = os.path.splitext(args.video)[0] + ".replaced_audio.mp4"
+            try:
+                cmd_rep = [
+                    "ffmpeg",
+                    "-y",
+                    "-i",
+                    args.video,
+                    "-i",
+                    args.out,
+                    "-map",
+                    "0:v:0",
+                    "-map",
+                    "1:a:0",
+                    "-c:v",
+                    "copy",
+                    "-c:a",
+                    "aac",
+                    "-b:a",
+                    "192k",
+                    out_video,
+                ]
+                subprocess.run(cmd_rep, check=True)
+                print(f"Vídeo con audio reemplazado generado: {out_video}")
+            except subprocess.CalledProcessError as e:
+                print(f"Error al reemplazar audio en el vídeo: {e}")
+
+    # limpieza
+    shutil.rmtree(tmpdir, ignore_errors=True)
+
+
+if __name__ == '__main__':
+    main()
--- a/whisper_project/transcribe.py
+++ b/whisper_project/transcribe.py
--- a/whisper_project/translate_srt_with_gemini.py
+++ b/whisper_project/translate_srt_with_gemini.py
@ -0,0 +1,139 @@
+#!/usr/bin/env python3
+"""translate_srt_with_gemini.py
+Lee un .srt, traduce cada bloque de texto con Gemini (Google Generative API) y
+escribe un nuevo .srt manteniendo índices y timestamps.
+
+Uso:
+  export GEMINI_API_KEY="..."
+  .venv/bin/python whisper_project/translate_srt_with_gemini.py \
+    --in whisper_project/dailyrutines.kokoro.dub.srt \
+    --out whisper_project/dailyrutines.kokoro.dub.es.srt \
+    --model gemini-2.5-flash
+
+Si no pasas --gemini-api-key, se usará la variable de entorno GEMINI_API_KEY.
+"""
+import argparse
+import json
+import os
+import time
+from typing import List
+
+import requests
+import srt
+# Intentar usar la librería oficial si está instalada (mejor compatibilidad)
+try:
+    import google.generativeai as genai  # type: ignore
+except Exception:
+    genai = None
+
+
+def translate_text_google_gl(text: str, api_key: str, model: str = "gemini-2.5-flash") -> str:
+    """Llamada a la API Generative Language de Google (generateContent).
+    Devuelve el texto traducido (o el texto original si falla).
+    """
+    if not api_key:
+        raise ValueError("gemini api key required")
+    # Si la librería oficial está disponible, usarla (maneja internamente los endpoints)
+    if genai is not None:
+        try:
+            genai.configure(api_key=api_key)
+            model_obj = genai.GenerativeModel(model)
+            # la librería acepta un prompt simple o lista; pedimos texto traducido explícitamente
+            prompt = f"Traduce al español el siguiente texto y devuelve solo el texto traducido:\n\n{text}"
+            resp = model_obj.generate_content(prompt, generation_config={"max_output_tokens": 1024, "temperature": 0.0})
+            # resp.text está disponible en la respuesta wrapper
+            if hasattr(resp, "text") and resp.text:
+                return resp.text.strip()
+            # fallback: revisar candidates
+            if hasattr(resp, "candidates") and resp.candidates:
+                c = resp.candidates[0]
+                if hasattr(c, "content") and hasattr(c.content, "parts"):
+                    parts = [p.text for p in c.content.parts if getattr(p, "text", None)]
+                    if parts:
+                        return "\n".join(parts).strip()
+        except Exception as e:
+            print(f"Warning: genai library translate failed: {e}")
+
+    # Fallback HTTP (legacy/path-variant). Intentamos v1 y v1beta2 según disponibilidad.
+    for prefix in ("v1", "v1beta2"):
+        endpoint = (
+            f"https://generativelanguage.googleapis.com/{prefix}/models/{model}:generateContent?key={api_key}"
+        )
+        body = {
+            "prompt": {"text": f"Traduce al español el siguiente texto y devuelve solo el texto traducido:\n\n{text}"},
+            "maxOutputTokens": 1024,
+            "temperature": 0.0,
+            "candidateCount": 1,
+        }
+        try:
+            r = requests.post(endpoint, json=body, timeout=30)
+            r.raise_for_status()
+            j = r.json()
+            # buscar candidatos
+            if isinstance(j, dict) and "candidates" in j and isinstance(j["candidates"], list) and j["candidates"]:
+                first = j["candidates"][0]
+                if isinstance(first, dict):
+                    if "content" in first and isinstance(first["content"], str):
+                        return first["content"].strip()
+                    if "output" in first and isinstance(first["output"], str):
+                        return first["output"].strip()
+                    if "content" in first and isinstance(first["content"], list):
+                        parts = []
+                        for c in first["content"]:
+                            if isinstance(c, dict) and isinstance(c.get("text"), str):
+                                parts.append(c.get("text"))
+                        if parts:
+                            return "\n".join(parts).strip()
+            for key in ("output_text", "text", "response", "translated_text"):
+                if key in j and isinstance(j[key], str):
+                    return j[key].strip()
+        except Exception as e:
+            print(f"Warning: GL translate failed ({prefix}): {e}")
+
+    return text
+
+
+def translate_srt_file(in_path: str, out_path: str, api_key: str, model: str):
+    with open(in_path, "r", encoding="utf-8") as fh:
+        subs = list(srt.parse(fh.read()))
+
+    for i, sub in enumerate(subs, start=1):
+        text = sub.content.strip()
+        if not text:
+            continue
+        # llamar a la API
+        try:
+            translated = translate_text_google_gl(text, api_key, model=model)
+        except Exception as e:
+            print(f"Warning: translate failed for index {sub.index}: {e}")
+            translated = text
+        # asignar traducido
+        sub.content = translated
+        # pequeño delay para no golpear la API demasiado rápido
+        time.sleep(0.15)
+        print(f"Translated {i}/{len(subs)}")
+
+    out_s = srt.compose(subs)
+    with open(out_path, "w", encoding="utf-8") as fh:
+        fh.write(out_s)
+    print(f"Wrote translated SRT to: {out_path}")
+
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--in", dest="in_srt", required=True)
+    p.add_argument("--out", dest="out_srt", required=True)
+    p.add_argument("--gemini-api-key", default=None)
+    p.add_argument("--model", default="gemini-2.5-flash")
+    args = p.parse_args()
+
+    key = args.gemini_api_key or os.environ.get("GEMINI_API_KEY")
+    if not key:
+        print("Provide --gemini-api-key or set GEMINI_API_KEY env var", flush=True)
+        raise SystemExit(2)
+
+    translate_srt_file(args.in_srt, args.out_srt, key, args.model)
+
+
+if __name__ == '__main__':
+    main()