Updated: January 2, 2026
7 min read

TinyTinyTPU FPGA Systolic‑Array AI Accelerator: Open‑Source 2×2 TPU‑Style Design

Answer: TinyTinyTPU is a compact 2×2 systolic‑array TPU‑style matrix‑multiply accelerator written in SystemVerilog and designed for deployment on low‑cost FPGAs, offering a hands‑on platform for learning TPU architecture and experimenting with AI inference on hardware.

TinyTinyTPU: Open‑Source 2×2 Systolic‑Array AI Accelerator for FPGA Developers

What is TinyTinyTPU?

TinyTinyTPU is an educational‑grade, open‑source implementation of Google’s Tensor Processing Unit (TPU) concept, scaled down to a 2×2 systolic‑array that can execute matrix‑multiply‑accumulate (MAC) operations on a modest Xilinx Artix‑7 FPGA (e.g., the Digilent Basys3 board). The project provides a complete hardware description, a UART‑based host interface, and a Python driver, enabling developers to run multi‑layer MLP inference directly on the FPGA.

Because the design occupies only about 5 % of the LUT resources on a Basys3, it serves as a perfect sandbox for students, hobbyists, and hardware startups who want to explore AI acceleration without the overhead of large‑scale ASICs.

Design Details of the 2×2 Systolic‑Array Matrix‑Multiply Unit

The core of TinyTinyTPU is a 2×2 systolic array composed of four processing elements (PEs). Data flows horizontally as activations and vertically as partial sums, creating a wave‑front pattern that mimics the data‑movement strategy of Google’s TPU v1.

Processing Elements (PEs): Each PE performs a single‑precision MAC operation and forwards its result to the next PE in the row.
Weight Loading: Weights are injected diagonally across the array, ensuring that each PE receives the correct coefficient at the right clock cycle.
Post‑MAC Pipeline: After accumulation, the data passes through a ReLU/ReLU6 activation block, a gain‑bias normalizer, and an 8‑bit quantizer with saturation.
Double‑Buffered Accumulator: Two banks of BRAM store intermediate results, allowing the next layer to start loading while the current layer drains.

The design is written in SystemVerilog and verified with the official GitHub repository. Testbenches built with cocotb cover every pipeline stage, from weight FIFO behavior to full‑system inference.

FPGA Implementation and Performance Metrics

When synthesized for the Basys3 (XC7A35T) device, TinyTinyTPU demonstrates the following resource usage:

Resource	Utilization
LUTs	≈ 1,000 (5 %)
Flip‑Flops	≈ 1,000 (3 %)
DSP48E1 slices	8
BRAM blocks	10‑15

At a 100 MHz clock, the accelerator can complete a 2×2 matrix multiplication in a single clock cycle after the pipeline fills, delivering a theoretical throughput of 100 MACs per microsecond. In practice, a full 2‑layer MLP inference (including activation, normalization, and quantization) runs in under 10 µs, which is more than sufficient for edge‑AI tasks such as gesture recognition or simple sensor‑fusion.

Because the design is fully parameterizable, developers can scale the array to larger dimensions (e.g., 4×4 or 8×8) by replicating the PE module and adjusting the weight‑loading logic, making TinyTinyTPU a flexible starting point for custom AI accelerators.

Potential Applications and the Open‑Source Impact

While TinyTinyTPU’s raw compute power is modest, its real value lies in the ecosystem it enables:

Education: Universities can use the design to teach systolic‑array concepts, hardware‑software co‑design, and FPGA toolchains (Vivado, Yosys/nextpnr).
Rapid Prototyping: Startups can prototype AI inference pipelines on inexpensive hardware before committing to ASIC development.
Edge AI Devices: Low‑power wearables, IoT gateways, or robotics platforms can embed the accelerator for on‑device inference, reducing latency and bandwidth usage.
Research: Researchers exploring quantization, pruning, or novel activation functions can plug their algorithms into the existing pipeline.

The project’s open‑source nature encourages community contributions—new testbenches, alternative host interfaces (SPI, I²C), or even integration with higher‑level AI frameworks via OpenAI ChatGPT integration on the UBOS platform.

Visual Overview

The diagram below illustrates the data flow through the 2×2 systolic array, the post‑MAC pipeline, and the UART bridge that connects the FPGA to a host PC.

Each block in the figure corresponds to a Verilog module described in the repository, making the hardware architecture transparent and easy to extend.

Get the Code – GitHub Repository

All source files, simulation scripts, and documentation are hosted on GitHub. Clone the repo, run the provided make test suite, and program the Basys3 board with a single command:

git clone https://github.com/alanma23/tinytinyTPU-co.git
cd tinytinyTPU-co
make vivado-build   # or make yosys-build for open‑source flow

For detailed build instructions, see the README.md in the repository. The project also includes a Python driver (tpu_driver.py) that abstracts the UART protocol, allowing you to load weights, feed activations, and read results with just a few lines of code.

How TinyTinyTPU Fits Into the Broader UBOS AI Ecosystem

UBOS offers a suite of tools that complement hardware accelerators like TinyTinyTPU. For instance, the AI SEO Analyzer can help you optimize the documentation of your open‑source project for discoverability, while the AI Video Generator lets you create tutorial videos that walk users through FPGA programming steps.

If you need a quick start, the UBOS templates for quick start include pre‑configured pipelines for data ingestion, model conversion, and deployment to edge devices. Pairing these templates with TinyTinyTPU’s hardware can accelerate the time‑to‑market for AI‑enabled products.

For teams looking to embed AI agents into their workflow, the AI marketing agents can automatically generate product descriptions based on inference results from TinyTinyTPU, creating a seamless loop from hardware inference to content creation.

Startups interested in leveraging TinyTinyTPU for proof‑of‑concepts can explore the UBOS for startups program, which offers discounted compute credits and mentorship on hardware‑software integration.

SMBs can benefit from the UBOS solutions for SMBs, which include managed FPGA provisioning and remote monitoring dashboards.

Enterprises seeking a scalable AI platform can evaluate the Enterprise AI platform by UBOS, which supports multi‑node TPU clusters and integrates with existing data pipelines.

All of these services are built on the UBOS platform overview, a modular architecture that abstracts hardware resources, making it trivial to swap a 2×2 array for a larger custom accelerator.

Developers who prefer a visual development environment can use the Web app editor on UBOS to design front‑end interfaces that display inference results in real time.

Automation of the build‑test‑deploy cycle is handled by the Workflow automation studio, which can trigger synthesis jobs whenever a new Verilog module is pushed to GitHub.

Pricing is transparent; see the UBOS pricing plans for details on free tier usage, which is sufficient for hobbyist projects like TinyTinyTPU.

Explore real‑world implementations in the UBOS portfolio examples to see how other teams have combined FPGA accelerators with cloud AI services.

Finally, if you want to become a partner and co‑develop new accelerator modules, the UBOS partner program offers co‑marketing, technical support, and revenue‑share options.

Conclusion – Why TinyTinyTPU Matters

TinyTinyTPU democratizes AI hardware by delivering a fully functional, open‑source TPU‑style accelerator that fits on a $150 development board. Its clear, modular design, comprehensive test suite, and seamless Python host interface make it an ideal learning tool and a launchpad for innovative edge‑AI products.

By integrating TinyTinyTPU with UBOS’s cloud‑native AI services—such as the AI YouTube Comment Analysis tool or the AI Article Copywriter—developers can create end‑to‑end pipelines that start with on‑device inference and end with AI‑generated content, analytics, or automated actions.

Ready to get your hands on a hardware AI accelerator? Clone the repository, flash the Basys3 board, and start experimenting today. For additional guidance, explore UBOS’s extensive documentation and community resources, or join the partner program to collaborate on the next generation of open‑source AI hardware.

Take the first step: download TinyTinyTPU and unleash AI at the edge.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

TinyTinyTPU FPGA Systolic‑Array AI Accelerator: Open‑Source 2×2 TPU‑Style Design

TinyTinyTPU: Open‑Source 2×2 Systolic‑Array AI Accelerator for FPGA Developers

What is TinyTinyTPU?

Design Details of the 2×2 Systolic‑Array Matrix‑Multiply Unit

FPGA Implementation and Performance Metrics

Potential Applications and the Open‑Source Impact

Visual Overview

Get the Code – GitHub Repository

How TinyTinyTPU Fits Into the Broader UBOS AI Ecosystem

Conclusion – Why TinyTinyTPU Matters

Carlos

Multi-language AI Translator

Calculate Time Complexity with ChatGPT API

Unified Authorization Template

AI-Powered Essay Outline Generator

Customer Relationship Management (CRM)

AI Video Generator

Sign up for our newsletter

TinyTinyTPU: Open‑Source 2×2 Systolic‑Array AI Accelerator for FPGA Developers

What is TinyTinyTPU?

Design Details of the 2×2 Systolic‑Array Matrix‑Multiply Unit

FPGA Implementation and Performance Metrics

Potential Applications and the Open‑Source Impact

Visual Overview

Get the Code – GitHub Repository

How TinyTinyTPU Fits Into the Broader UBOS AI Ecosystem

Conclusion – Why TinyTinyTPU Matters

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password