The text-to-dense representation techniques vary, evolving from character bi-grams to advanced subword vectorizers, combating OOV challenges like adversarial attacks and typos.
As the strategies include subword-level tokenization and decomposing unknown words into n-grams for effective neural network training.
Researchers at Google recently developed and unveiled a new resilient and efficient text vectorizer dubbed “RETVec,” which will defend Gmail users against malicious emails and spam.
Document
Protect Your Storage With SafeGuard
Is Your Storage & Backup Systems Fully Protected? – Watch 40-second Tour of SafeGuard
StorageGuard scans, detects, and fixes security misconfigurations and vulnerabilities across hundreds of storage and backup devices.
RETVec
RETVec is an efficient, multilingual, next-gen text vectorizer with built-in adversarial resilience. This next-gen text vectorizer is resilient to character-level manipulations like-
Insertion
Deletion
Typos
Homoglyphs
LEET substitution
There are two layers in the RETVec character encoder, and here below, we have mentioned those layers:-
Integerizer layer
Binarizer layer
RETVec architecture (Source – Arxiv)
RETVec uses a unique character encoder, handling UTF-8 efficiently. It effortlessly supports 100+ languages without lookup tables or fixed vocabulary. Also, being a layer, it seamlessly fits into any TF model without extra pre-processing.
RETVec Binarizer boosts word representation but lacks competitiveness. Researchers enhance it with a small model, boosting accuracy and outperforming others.
TensorFlow models easily employ RETVec for string vectorization in just one line. Besides this, the raw strings were handled with built-in pre-processing.
Moreover, this system also works perfectly for on-device mobile and web use cases since it supports the:-
TFJS
TF Lite
Researchers tested RETVec against adversarial content using a Google spam filter. Swapping SentencePiece with RETVec improved spam detection by 38% at a 0.80% false positive rate, reducing latency by 30%.
This suggests RETVec is competitive for real-world tasks, boosting confidence in its effectiveness.
How to optimize RETVec for better multilingual skills, robustness, and smaller models in large language models (LLMs) is a key question. For smaller LLMs, where the vocabulary layer can be over 20% of the parameters, RETVec eliminates it.
Yet, using RETVec in generative models poses challenges, as its 256-float embedding doesn’t directly convert to softmax output. A new training method compatible with text generation is needed.
Experimenting with character-by-character decoding and the VQ-VAE model renders indecisive results. Future work addresses these limitations and explores RETVec’s use as a word embedding, replacing GloVe and word2vec and training text similarity models with its character encoder.
Installation
To install the latest TensorFlow version of RETVec, you can use “pip”:-
pip install retvec
Besides this, on TensorFlow 2.6+ and Python 3.8+, the RETVec has already been tested.
Experience how StorageGuard eliminates the security blind spots in your storage systems by trying a 14-day free trial.
The post Google Revealed RETVec to Defend Malicious Emails & Spam for Gmail Users appeared first on Cyber Security News.
Cyber Security News