Description
MediaPipe Solution (you are using)
MediaPipe LLM Inference API
Programming language
TBD
Are you willing to contribute it
Yes
Describe the feature and the current behaviour/state
At the moment, for Gen AI use cases in the browser e.g. Gemma 2B with the MediaPipe LLM Inference API, there's no way for a developer to know ahead of time whether the model can actually run on the device within reasonable times. This is an issue because:
- For Gen AI, the model download is really large (1.3GB almost for Gemma 2B, which is manyfold the recommended web app size)
- Running an inference on devices that have low spec or too much operations already running may be really slow, or even crash a device (on mobile).
This leads to a subpar UX where a user may have waited to download a large model that can't actually run inferences within reasonable times on their device, or that may even crash their device.
What if we ran a mini-benchmark ahead of model download? This is beaufortfrancois@'s idea he suggested for Transformers.js: huggingface/transformers.js#545 (comment).
This would involve running the model code with zeroed-out weights.
Will this change the current API? How?
Yes, as we'd want to expose to developers the output of the mini-benchmark. This output may be abstracted behind a few dev-friendly performance buckets e.g. high
, medium
, low
. Developers could overlay their own logic based on that output.
Who will benefit with this feature?
All developers for on-device/in-browser use cases
Please specify the use cases for this feature
All on-device/in-browser use cases
Any Other info
No response