Tearing down the Rewind app

Rewind 是最近比较火的一款 AI 应用，它在 Mac 本地运行，可以记录屏幕上的文字和语音信息，可以通过搜索关键字来找到这些信息（只要这些文字曾经出现在 App 中，无论是网页还是聊天），可以看下这个 demo。

它很好的解决了「好像在什么地方看到过，但记不起来了」这个场景。听着挺 Magic 的，这篇文章就对这项技术进行了逆向分析。它的主要工作流程是每隔 2 秒，对最 Top 的 window 截屏，利用 OCR 工具提取文字信息，存储到 SQLite，然后再将这些图片合并为一个 H.264 的视频文件（减少体积）。主要用到了以下工具：

Use accessibility APIs to identify the frontmost window.
Store the timestamps to a SQLite database in the user’s Library folder.
Use ScreenCaptureKit to hide disallowed windows, including private browser windows and a user-defined exclusion list.
OCR the screenshot on-device using Apple’s Vision framework, the same pipeline that powers Live Text.
Compress the screenshot sequence to an H.264 video with FFmpeg.

还会结合 OpenAI 的 Whisper 工具来将语音转为文字。这些工具单独拿出来，其实都没有太「高级」，但能想到这个 idea，并且将它们合理地运用并「组装」成一款优秀的产品，则更为不易。这可能也是 AI 时代，我们应该具备的品质：更创新，更综合。