Create AI Videos with
Happy Horse

Just provide text or images, and Happy Horse will instantly produce cinema-grade videos with intelligent soundtracks.

#1
Video Arena T2V
1333
Elo Rating
15B
Parameters
7
Lip-Sync Languages

Video Generator

Turn your prompts into cinematic videos โ€” try it now.

0 / 2500

Powered by Happy Horse 1.0 ยท Elo 1333

Video Preview

Enter a prompt and click Generate

16:9720pโ™ช Audio

About HappyHorse

Developed and released in early 2026 by the Happy Horse team, Happy Horse 1.0 is built around a 40-layer self-attention Transformer architecture. It is fully open source and includes commercial-use rights, the base model, the distilled model, a super-resolution module, and inference code โ€” designed to be self-hosted, fine-tuned, and deployed on your own infrastructure.

Unified Self-Attention

Unlike DiT-based models that use cross-attention for text conditioning, HappyHorse puts text, image, video, and audio into the same token sequence. Attention handles all modality fusion natively.

Sandwich Layer Layout

First 4 and last 4 layers handle modality-specific embedding and decoding; the middle 32 layers share parameters across all modalities for maximum parameter efficiency.

Per-Head Sigmoid Gating

Learned scalar gates on each attention head selectively dampen destructive gradients, solving the notorious instability of joint multimodal training.

Timestep-Free Denoising

No explicit timestep embeddings โ€” the model infers denoising state directly from the noise level of input latents, enabling aggressive 8-step DMD-2 distillation.

Core Capabilities of HappyHorse

A unified multimodal architecture purpose-built for joint video and audio generation.

Unified Transformer

40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers โ€” single-stream processing with per-head gating for stable training.

Joint Video + Audio

Generates synchronized dialogue, ambient sound, and Foley alongside video frames โ€” no post-production dubbing required.

8-Step DMD-2 Distillation

Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.

Multilingual Lip-Sync

Native support for English, Mandarin, Cantonese, Japanese, Korean, German, and French with industry-leading low Word Error Rate.

1080p Output

5โ€“8 second clips at 1080p in standard aspect ratios (16:9, 9:16) โ€” suitable for social, advertising, and cinematic use cases.

Open & Self-Hostable

Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.

Artificial Analysis Video Arena

Ranked by blind user voting on the Artificial Analysis Video Arena โ€” the standard public benchmark using Elo ratings computed from head-to-head comparisons where users don't know which model made which output. No self-reported benchmarks.

#1
Elo 1333
Text-to-Video (no audio)
#1
Elo 1392
Image-to-Video (no audio)
#2
Elo 1205
Text-to-Video (with audio)
#2
Elo 1161
Image-to-Video (with audio)
+60 Elo vs Seedance 2.0 (T2V)
+37 Elo vs Seedance 2.0 (I2V)
80.0% win rate vs OVI 1.1

Text-to-Video Leaderboard

Top of the Artificial Analysis T2V leaderboard (no audio), early April 2026.

RankModelEloReleased
๐Ÿ† #1HappyHorse-1.01333Apr 2026
#2Seedance 2.0 720p1273Mar 2026
#3SkyReels V41245Mar 2026
#4Kling 3.0 1080p Pro1241Feb 2026
#5PixVerse V61240Mar 2026

Image-to-Video Leaderboard

Top of the Artificial Analysis I2V leaderboard (no audio), early April 2026.

RankModelEloReleased
๐Ÿ† #1HappyHorse-1.01392Apr 2026
#2Seedance 2.01355Mar 2026
#3PixVerse V61338Mar 2026
#4Grok Imagine Video1333Mar 2026
#5Kling 3.0 Omni1297Feb 2026

Data sourced from Artificial Analysis Video Arena. Elo ratings are recomputed continuously as new votes come in. Snapshot as of early April 2026.

Try HappyHorse 1.0

Happy Horse 1.0 runs on high-performance GPUs such as NVIDIA H100 or A100 (โ‰ฅ48GB VRAM recommended). FP8 quantization and the 8-step distilled checkpoint reduce memory footprint for single-GPU deployment.

View live rankings on Artificial Analysis Video Arena
1from happyhorse import HappyHorseModel
2
3model = HappyHorseModel.from_pretrained(
4 "happy-horse/happyhorse-1.0"
5)
6
7video, audio = model.generate(
8 prompt="an elder on a mountain peak overlooking the valley",
9 duration_seconds=5,
10 fps=24,
11 language="en",
12)
13
14video.save("output.mp4")
15audio.save("output.wav")

Built by Researchers, Trusted by Builders

HappyHorse is published and maintained by the HappyHorse research team, with a transparent technical report covering architecture, training methodology, distillation, benchmark protocols, and known limitations.

Expertise

Authored by practitioners working on multimodal Transformers, diffusion distillation, and large-scale video pretraining.

Transparency

Open weights, open inference code, and published benchmark methodology โ€” verifiable by independent researchers.

Responsibility

We support content provenance, watermarking, and downstream moderation. Users are expected to comply with applicable AI regulations.

Frequently Asked Questions

Answers to common questions about Happy Horse 1.0.

Happy Horse 1.0 is a 15B-parameter open-source AI video generation model that jointly produces video and synchronized audio from text or image prompts. It first appeared as a "mystery model" on the Artificial Analysis Video Arena, where it reached #1 globally in both Text-to-Video (Elo 1333) and Image-to-Video (Elo 1392) under blind user voting.

The team behind HappyHorse-1.0 has not been publicly identified. Artificial Analysis describes the submission as "pseudonymous." Community speculation points to an Asia-based origin, but no organization has officially claimed credit as of April 2026.

Artificial Analysis runs a Video Arena where users submit a prompt, the system generates outputs from two models, and users pick the one they prefer โ€” without knowing which model made which. Votes feed into an Elo rating system (same math as chess rankings). This means rankings are based entirely on aggregate human preference under blind conditions, not self-reported benchmarks.

As of early April 2026: Text-to-Video (no audio) #1 at Elo 1333, Image-to-Video (no audio) #1 at Elo 1392, Text-to-Video (with audio) #2 at Elo 1205, Image-to-Video (with audio) #2 at Elo 1161. The previous #1 in T2V was Dreamina Seedance 2.0 at Elo 1273 โ€” a 60-point gap.

Six natively supported languages: Chinese, English, Japanese, Korean, German, and French. A secondary marketing page lists Cantonese as a seventh language. The official count should be confirmed at release.

A unified self-attention Transformer with 40 layers in a sandwich layout โ€” the first and last 4 layers use modality-specific projections, while the middle 32 layers share parameters across all modalities. Text, image, video, and audio tokens are jointly denoised within one token sequence. No cross-attention. DMD-2 distillation reduces sampling to 8 steps without classifier-free guidance.

An NVIDIA H100 or A100 GPU with at least 48GB VRAM is recommended. Reported inference speeds are roughly 2 seconds for a 5-second clip at 256p, and roughly 38 seconds for 1080p on an H100. These are self-reported figures and have not been independently verified.

Unconfirmed. Community speculation exists because anonymous pre-launch drops are common in the AI ecosystem โ€” the Pony Alpha / GLM-5 precedent being the most recent example. However, no direct evidence connects HappyHorse to Alibaba's WAN model family, and the architecture description doesn't obviously match publicly known WAN architecture.