Create AI Videos with
Happy Horse
Just provide text or images, and Happy Horse will instantly produce cinema-grade videos with intelligent soundtracks.
Video Generator
Turn your prompts into cinematic videos โ try it now.
Powered by Happy Horse 1.0 ยท Elo 1333
Video Preview
Enter a prompt and click Generate
About HappyHorse
Developed and released in early 2026 by the Happy Horse team, Happy Horse 1.0 is built around a 40-layer self-attention Transformer architecture. It is fully open source and includes commercial-use rights, the base model, the distilled model, a super-resolution module, and inference code โ designed to be self-hosted, fine-tuned, and deployed on your own infrastructure.
Unified Self-Attention
Unlike DiT-based models that use cross-attention for text conditioning, HappyHorse puts text, image, video, and audio into the same token sequence. Attention handles all modality fusion natively.
Sandwich Layer Layout
First 4 and last 4 layers handle modality-specific embedding and decoding; the middle 32 layers share parameters across all modalities for maximum parameter efficiency.
Per-Head Sigmoid Gating
Learned scalar gates on each attention head selectively dampen destructive gradients, solving the notorious instability of joint multimodal training.
Timestep-Free Denoising
No explicit timestep embeddings โ the model infers denoising state directly from the noise level of input latents, enabling aggressive 8-step DMD-2 distillation.
Core Capabilities of HappyHorse
A unified multimodal architecture purpose-built for joint video and audio generation.
Unified Transformer
40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers โ single-stream processing with per-head gating for stable training.
Joint Video + Audio
Generates synchronized dialogue, ambient sound, and Foley alongside video frames โ no post-production dubbing required.
8-Step DMD-2 Distillation
Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.
Multilingual Lip-Sync
Native support for English, Mandarin, Cantonese, Japanese, Korean, German, and French with industry-leading low Word Error Rate.
1080p Output
5โ8 second clips at 1080p in standard aspect ratios (16:9, 9:16) โ suitable for social, advertising, and cinematic use cases.
Open & Self-Hostable
Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.
Artificial Analysis Video Arena
Ranked by blind user voting on the Artificial Analysis Video Arena โ the standard public benchmark using Elo ratings computed from head-to-head comparisons where users don't know which model made which output. No self-reported benchmarks.
Text-to-Video Leaderboard
Top of the Artificial Analysis T2V leaderboard (no audio), early April 2026.
| Rank | Model | Elo | Released |
|---|---|---|---|
| ๐ #1 | HappyHorse-1.0 | 1333 | Apr 2026 |
| #2 | Seedance 2.0 720p | 1273 | Mar 2026 |
| #3 | SkyReels V4 | 1245 | Mar 2026 |
| #4 | Kling 3.0 1080p Pro | 1241 | Feb 2026 |
| #5 | PixVerse V6 | 1240 | Mar 2026 |
Image-to-Video Leaderboard
Top of the Artificial Analysis I2V leaderboard (no audio), early April 2026.
| Rank | Model | Elo | Released |
|---|---|---|---|
| ๐ #1 | HappyHorse-1.0 | 1392 | Apr 2026 |
| #2 | Seedance 2.0 | 1355 | Mar 2026 |
| #3 | PixVerse V6 | 1338 | Mar 2026 |
| #4 | Grok Imagine Video | 1333 | Mar 2026 |
| #5 | Kling 3.0 Omni | 1297 | Feb 2026 |
Data sourced from Artificial Analysis Video Arena. Elo ratings are recomputed continuously as new votes come in. Snapshot as of early April 2026.
Try HappyHorse 1.0
Happy Horse 1.0 runs on high-performance GPUs such as NVIDIA H100 or A100 (โฅ48GB VRAM recommended). FP8 quantization and the 8-step distilled checkpoint reduce memory footprint for single-GPU deployment.
View live rankings on Artificial Analysis Video Arena1from happyhorse import HappyHorseModel23model = HappyHorseModel.from_pretrained(4 "happy-horse/happyhorse-1.0"5)67video, audio = model.generate(8 prompt="an elder on a mountain peak overlooking the valley",9 duration_seconds=5,10 fps=24,11 language="en",12)1314video.save("output.mp4")15audio.save("output.wav")Built by Researchers, Trusted by Builders
HappyHorse is published and maintained by the HappyHorse research team, with a transparent technical report covering architecture, training methodology, distillation, benchmark protocols, and known limitations.
Expertise
Authored by practitioners working on multimodal Transformers, diffusion distillation, and large-scale video pretraining.
Transparency
Open weights, open inference code, and published benchmark methodology โ verifiable by independent researchers.
Responsibility
We support content provenance, watermarking, and downstream moderation. Users are expected to comply with applicable AI regulations.
Frequently Asked Questions
Answers to common questions about Happy Horse 1.0.
Happy Horse 1.0 is a 15B-parameter open-source AI video generation model that jointly produces video and synchronized audio from text or image prompts. It first appeared as a "mystery model" on the Artificial Analysis Video Arena, where it reached #1 globally in both Text-to-Video (Elo 1333) and Image-to-Video (Elo 1392) under blind user voting.
The team behind HappyHorse-1.0 has not been publicly identified. Artificial Analysis describes the submission as "pseudonymous." Community speculation points to an Asia-based origin, but no organization has officially claimed credit as of April 2026.
Artificial Analysis runs a Video Arena where users submit a prompt, the system generates outputs from two models, and users pick the one they prefer โ without knowing which model made which. Votes feed into an Elo rating system (same math as chess rankings). This means rankings are based entirely on aggregate human preference under blind conditions, not self-reported benchmarks.
As of early April 2026: Text-to-Video (no audio) #1 at Elo 1333, Image-to-Video (no audio) #1 at Elo 1392, Text-to-Video (with audio) #2 at Elo 1205, Image-to-Video (with audio) #2 at Elo 1161. The previous #1 in T2V was Dreamina Seedance 2.0 at Elo 1273 โ a 60-point gap.
Six natively supported languages: Chinese, English, Japanese, Korean, German, and French. A secondary marketing page lists Cantonese as a seventh language. The official count should be confirmed at release.
A unified self-attention Transformer with 40 layers in a sandwich layout โ the first and last 4 layers use modality-specific projections, while the middle 32 layers share parameters across all modalities. Text, image, video, and audio tokens are jointly denoised within one token sequence. No cross-attention. DMD-2 distillation reduces sampling to 8 steps without classifier-free guidance.
An NVIDIA H100 or A100 GPU with at least 48GB VRAM is recommended. Reported inference speeds are roughly 2 seconds for a 5-second clip at 256p, and roughly 38 seconds for 1080p on an H100. These are self-reported figures and have not been independently verified.
Unconfirmed. Community speculation exists because anonymous pre-launch drops are common in the AI ecosystem โ the Pony Alpha / GLM-5 precedent being the most recent example. However, no direct evidence connects HappyHorse to Alibaba's WAN model family, and the architecture description doesn't obviously match publicly known WAN architecture.