OCRLocal LLMPrivacy 2026-04-12

Over Two Decades of Handwriting, One Local Language Model, and an OCR Problem

Building a Cyrillic handwriting OCR pipeline for 20+ years of personal diaries. TrOCR, LoRA fine-tuning, and why the hardest part is not the model.

Also on Medium.

On building a personal AI from a physical diary archive, and why the hardest part of the pipeline isn’t the model

There’s a white cat sitting in the middle of my life.

In the photograph, he’s settled himself on a pile of notebooks, 30 of them, maybe more, spread across a parquet floor. His name was Archi. He’s gone now. Some have Van Gogh prints on their covers. Some are spiral-bound school notebooks. A few look like they survived something. All of them are full of Russian cursive, written by a younger version of me who had no idea she’d one day try to feed them to a neural network.

This is the archive I’m working with: over two decades of personal diary entries, written in Russian, by hand, in notebooks that smell like time. And I’ve decided to build a language model trained on my own inner life.

Why this project exists

Most “personal AI” projects I’ve read about start the same way: someone exports their Gmail archive, or scrapes their Medium posts, or dumps a few years of Notion notes into a vector database. The data is already digital. The hard part is the modeling.

My data is not digital. It never was.

I’ve kept handwritten diaries for over two decades, through multiple cities, languages, relationships, jobs, and versions of myself. The physical notebooks are the only place where I wrote without any awareness of an audience. No drafts, no edits, no metadata. Just raw thinking in Russian cursive, often late at night, often barely legible.

If I want those years to be part of a language model, I have to solve a problem that almost no one in the personal AI space has touched: handwritten Cyrillic OCR at the scale of a human life.

The technical problem rarely documented in the personal AI context

The current landscape of “digital self” projects, HereAfter AI, Delphi, Replika, the dozens of “I cloned myself in 24 hours” Medium articles, shares one invisible assumption: the source data is already text.

Search for “handwritten diary AI” and you find almost nothing. Search for “Cyrillic OCR personal project” and you find nothing. Every single diary-to-LLM project I could find used typed, digital, English-language text. The physical-to-digital bridge doesn’t exist yet as a documented personal project. This one is an attempt to build it.

The archive: approximately 1,500 pages photographed so far from iPhone, converted from HEIC to JPEG, stripped of metadata including GPS. All processing runs on my own hardware.

Of those pages, fifteen have been manually transcribed as ground truth. The rest remain undigitized as text. That is the honest status of where I am.

What didn’t work, and why that matters

This is where most similar articles would skip ahead to results. I’m not going to do that.

Here is a summary of every tool I tested, on the same fifteen representative pages. CER values marked with † are from my own experiments on this specific handwriting and image conditions, not published benchmarks:

Tool	Approach	CER	Why it failed
Tesseract	Classic OCR	high†	No cursive support
Surya OCR	Line detection	-	Single bbox on textured background
Qwen2-VL	Full-page VLM	~45%†	Hallucinations, fluent nonsense
Gemma4:e2b	Full-page VLM	~45%†	Same pattern
TrOCR (cyrillic)	Line-level HTR	-*	Needs pre-segmented lines
Transkribus	Cloud HTR	unacceptable†	Generic model, wrong handwriting

*TrOCR produced Church Slavonic default output, CER not meaningful to measure. †Personal measurement on 15 pages, not a published benchmark.

The most instructive failure is TrOCR. It’s the best available offline option for Cyrillic handwriting, pre-trained on 73,830 segments across two training stages. The problem is architectural: it expects a single cropped row of text as input. Feed it a full diary page and it outputs default patterns from its training data. The model is fine; the missing pipeline step before it, reliable line segmentation on informal photographs, is what blocks progress.

Line segmentation works well on clean scanned documents. On iPhone photos of notebooks shot against fabric or wood grain, with page curl and variable lighting, it doesn’t. That’s the specific bottleneck.

Transkribus has generic Russian models with strong published benchmarks, but those benchmarks reflect other people’s handwriting. On mine, without custom training, the results were not usable.

The OCR problem, for this specific use case, is not solved. That is the honest status.

The workaround, and what it cost

While waiting for OCR to become tractable, I started building ground truth manually. Fifteen pages transcribed so far, not by typing, but by reading each page aloud and using voice-to-text (Wispr Flow).

Reading your own diary from ten years ago is already disorienting. Reading it aloud, slowly, trying to parse handwriting you recognize as yours but can barely decipher, reconstructing what you actually wrote, that’s something else entirely. It’s archaeological. You’re excavating a past self from the physical record she left.

I’m not sure that process can be designed around. It might just be part of the project.

First fine-tuning results

On Apple Silicon, Unsloth doesn’t run, it requires CUDA for training. The right tool here is MLX, Apple’s native ML framework, which supports LoRA fine-tuning natively.

While the handwritten archive remains largely undigitized, I ran a parallel experiment: fine-tuning on ~262K tokens of typed personal text, notes, journal entries, exported conversations, all in the same language, all personal. Not the main corpus. A proof of concept.

Base model: Qwen2.5-3B, small enough to fit in 18GB RAM, capable enough to produce coherent output. Two training passes:

Pass 1: 300 iterations, ~103K tokens. Repetition loops, language mixing, Chinese characters bleeding through from the base model’s training data. Not usable.

Pass 2: 1,000 iterations, ~262K tokens. Same prompt, measurable difference:

Without adapter: “Иногда мне кажется, что я не могу быть счастливым. Как это исправить? Если вы чувствуете себя несчастным…” (generic self-help output, second person, impersonal).

With adapter: “Иногда мне кажется, что я, твоя маленькая любовь, но я думаю, что это лишь субъективное восприятие.” (first person, emotional, calibrated to the register of the training data).

The model is not me. But it is recognizably attempting to be me. That’s a different thing, and it’s enough to continue.

What “a digital copy of yourself” actually means

I want to be precise about what this project is and isn’t.

It isn’t resurrection technology. It isn’t grief processing. It isn’t a product designed to comfort anyone after I’m gone. Those are legitimate things people are building, MIT Media Lab’s Future You project, HereAfter AI, Eternos, but they’re not what this is.

What this is, more precisely, is a longitudinal linguistic archive with a generative interface.

Over two decades of someone writing about their life in their native language, in their private handwriting, with no audience in mind, that’s not just data. It’s a record of how a person thinks, what words they reach for, how their sentence rhythms shift under pressure or joy, which ideas they return to across decades. A language model trained on that corpus doesn’t “know” me in any meaningful sense. But it has absorbed patterns that no other model has absorbed, from a source that no other model has seen.

What I’m curious about is this: when you talk to it, what comes back?

The layer that makes this specific

I think and write a bit differently than most people around me. I’ve known this for a long time, in various framings.

Diaries, for me, functioned as a kind of external processing system. Writing helped me understand what had happened, what I felt about it, what I wanted. The notebooks aren’t a record of a life, they’re a record of a mind working through a life, from adolescence to now.

There’s also the language. I think in Russian. I dream in Russian. When I’m tired or scared or working through something difficult, I reach for Russian syntax. The English I write for work and for publications like this one is a competent second language, but it doesn’t have the same grain.

A model trained on over two decades of diary writing in your first language is, in some sense, trained on the version of you that is least performed. That’s different from training on blog posts or emails or anything written for someone else.

Where this stands now

The pipeline is set up. Approximately 1,500 pages are photographed and preprocessed. Fifteen have been manually transcribed as ground truth for OCR evaluation. The tools have been tested and mostly found wanting. The first fine-tuning passes have run on the digital corpus.

The handwritten archive is still waiting. When OCR is solved, through fine-tuned TrOCR with proper line segmentation, through Transkribus with a custom model trained on my specific handwriting, through something that doesn’t exist yet, those pages will add roughly 1-2 million tokens of unperformed private writing to the corpus. That’s where the project becomes what I think it can be.

Everything I learn gets documented, both the technical findings and the stranger, harder-to-quantify ones about what it means to do this kind of work on this kind of material.

The notebooks have been sitting in a drawer for years. It turns out they’re also a dataset.

Technical stack: pillow-heif, opencv-python, deskew, page-dewarp, cyrillic-trocr/trocr-handwritten-cyrillic, mlx-lm, Qwen2.5-3B, LoRA (MLX), Ollama, Wispr Flow