- Updated: March 11, 2026
- 7 min read
According to Me: Long-Term Personalized Referential Memory QA
Direct Answer
The paper introduces ATM‑Bench, a comprehensive benchmark for long‑term, personalized referential memory question answering, and a novel architecture called Schema‑Guided Memory (SGM) that enables AI assistants to store, retrieve, and reason over multimodal user memories across months or years. This matters because it tackles the missing piece in today’s assistants: the ability to remember *you* over the long haul and to use that memory reliably in downstream tasks.
Background: Why This Problem Is Hard
Personalized AI assistants are expected to act like extensions of a user’s mind—recalling past conversations, images, documents, and even preferences that span weeks, months, or years. In practice, three intertwined challenges prevent current systems from delivering that experience:
- Scale of long‑term memory. Traditional transformer‑based models excel at short context windows (e.g., 4K tokens) but quickly forget information beyond that horizon.
- Referential ambiguity. Users often refer to past events with pronouns, nicknames, or visual cues (“the photo from our trip to Kyoto”). Disambiguating such references requires a structured representation that links language to other modalities.
- Evaluation gap. Existing QA benchmarks focus on static knowledge bases or short dialogues, offering no systematic way to measure how well a system retains and reasons over personal, multimodal histories.
Attempts to patch these gaps—such as external vector stores, retrieval‑augmented generation, or simple key‑value caches— suffer from brittle retrieval, lack of schema enforcement, and no unified metric for “personalized referential memory.” Consequently, developers cannot reliably gauge progress or compare approaches, and product teams cannot ship assistants that truly remember users.
What the Researchers Propose
The authors address both the benchmark deficit and the architectural shortfall with a two‑pronged contribution:
- ATM‑Bench (According to Me Benchmark). A curated suite of 12,000+ question‑answer pairs that span text, images, audio, and video. Each item is anchored to a user‑specific memory schema (e.g., Travel‑Log, Health‑Record, Home‑Inventory) and requires the model to resolve referential cues (“the last receipt I showed you”). The benchmark also provides a set of “memory‑update” actions that simulate real‑world interactions (adding a new photo, editing a note).
- Schema‑Guided Memory (SGM). An end‑to‑end neural framework that couples a Schema Engine with a Memory Store and a Reasoning Decoder. The Schema Engine enforces a typed graph structure (entities, attributes, relations) that mirrors how users organize personal data. The Memory Store persists embeddings of multimodal content keyed by schema nodes, while the Reasoning Decoder attends over both the current query and the schema‑constrained memory graph to generate answers.
Key roles in SGM:
- Schema Builder. Dynamically constructs or updates the user’s personal schema based on incoming data (e.g., “add a new
Tripnode withLocation=Kyoto). It ensures consistency and provides a “semantic index” for retrieval. - Memory Encoder. Converts raw modalities (text, image, audio) into dense vectors while preserving schema tags, enabling cross‑modal similarity search.
- Referential Resolver. Maps ambiguous user references (“the one from last week”) to concrete schema nodes using a combination of temporal heuristics and learned attention.
- Answer Generator. Conditions on the resolved nodes and the original query, producing natural language answers that can cite specific memory entries (e.g., “Your receipt from March 3 is stored as
Receipt_20240303.pdf).
How It Works in Practice
The SGM workflow can be visualized as a loop of three phases: Ingestion → Indexing → Query. Below is a step‑by‑step description of a typical user interaction:
- Ingestion. The user uploads a photo of a new kitchen appliance. The system extracts visual features, runs OCR on any label, and invokes the Schema Builder to create a
Appliancenode under theHome‑Inventoryschema, attaching attributes liketype=blender,purchase_date=2024‑02‑15. - Indexing. The Memory Encoder stores the photo’s embedding alongside the schema tags in a persistent vector store. Simultaneously, a lightweight metadata index (timestamp, modality) is updated for fast temporal look‑ups.
- Query. Later, the user asks, “Do I still have the blender I bought last year?” The Referential Resolver parses “the blender I bought last year,” matches it to the
Appliancenode with the appropriatepurchase_date, and retrieves the associated embedding. - Reasoning. The Answer Generator attends to the retrieved node, the original query, and any relevant context (e.g., recent warranty updates) to produce a concise answer: “Yes, the blender you bought on Feb 15 2023 is still in your inventory (ID Appliance_20230215).”
What sets SGM apart from prior retrieval‑augmented models is the schema constraint that guides every step. Instead of a flat similarity search, the system navigates a typed graph, dramatically reducing false positives (e.g., confusing a “blender” with a “smoothie recipe”) and enabling natural referential reasoning (“the one from last week”). Moreover, the architecture is modality‑agnostic: the same schema nodes can point to text notes, audio recordings, or video clips, allowing truly multimodal personal assistants.
Evaluation & Results
To validate ATM‑Bench and SGM, the authors conducted three families of experiments:
- Memory Retention Test. Models were asked to answer questions after varying “forget intervals” (1 day, 30 days, 180 days). SGM maintained >85% accuracy even after six months, whereas retrieval‑augmented baselines dropped below 60%.
- Referential Disambiguation Test. Queries containing pronouns, temporal cues, or visual descriptors were evaluated. SGM correctly resolved 92% of ambiguous references, a 27‑point gain over a vanilla RAG system.
- Multimodal Fusion Test. Questions that required combining text and image evidence (e.g., “What color was the dress I wore to the conference in Berlin?”) showed SGM achieving 88% exact‑match accuracy, compared to 71% for the best multimodal baseline.
Beyond raw numbers, the experiments demonstrated two critical insights:
- Schema enforcement improves robustness. By constraining retrieval to schema‑aligned nodes, SGM avoided “hallucinations” that plagued other models, especially in long‑term settings.
- End‑to‑end training yields emergent reasoning. When the entire pipeline (Schema Builder → Memory Encoder → Answer Generator) was fine‑tuned on ATM‑Bench, the model learned to infer implicit relations (e.g., “the receipt for the flight I booked last month”) without explicit supervision.
Why This Matters for AI Systems and Agents
For practitioners building next‑generation assistants, the implications are immediate:
- Long‑term user engagement. Retaining accurate personal memories enables assistants to provide proactive, context‑aware suggestions—think “You haven’t replaced the air filter in your HVAC system since last year; would you like a reminder?”
- Reduced reliance on external databases. By embedding memory directly within the model’s architecture, developers can lower latency and avoid costly third‑party storage contracts.
- Improved safety and compliance. A schema‑driven memory can enforce data‑governance policies (e.g., GDPR “right to be forgotten”) by locating and deleting specific nodes without affecting unrelated data.
- Modular integration. The SGM components map cleanly onto existing AI infrastructure: the Schema Builder can be a microservice, the Memory Store aligns with vector databases like Vector DB, and the Reasoning Decoder plugs into any LLM orchestration layer.
In short, SGM offers a blueprint for building assistants that truly “know you” over months and years, moving the industry beyond the “stateless chatbot” paradigm.
What Comes Next
While the results are promising, the authors acknowledge several limitations that open fertile research avenues:
- Scalability of schema evolution. As user data grows, maintaining a coherent schema without manual curation becomes challenging. Future work could explore automated schema induction using meta‑learning.
- Privacy‑preserving memory. Storing personal embeddings raises security concerns. Techniques like homomorphic encryption or federated schema updates could reconcile personalization with privacy.
- Cross‑user knowledge sharing. Certain domains (e.g., health) benefit from aggregated insights while preserving individuality. Extending SGM to support controlled, anonymized knowledge graphs is an open problem.
- Real‑time adaptation. Current experiments batch updates; integrating continuous, streaming memory updates will be essential for truly interactive assistants.
Potential applications span from enterprise knowledge workers—who need a “personal archive” of meetings and documents—to consumer devices that act as lifelong companions. Companies interested in prototyping such capabilities can start by integrating the Agents Platform with a schema‑aware vector store, leveraging the open‑source reference implementation released alongside the paper.
References
Mei, J., Chen, J., Yang, G., Hou, X., Li, M., & Byrne, B. (2026). According to Me: Long‑Term Personalized Referential Memory QA. arXiv preprint arXiv:2603.01990.