graph LR; A[Data Collection] --> B[Preprocessing]; B --> C[Feature Extraction]; C --> D[Model Training]; D --> E[Evaluation]; E --> F[Deployment]; style A fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style B fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style C fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style D fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style E fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style F fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000
Notes on Image Generation
Notes on Image Generation
Abstract
Current bottlenecks along AI/ML pipelines, that can be complex, extensive and costly. As an illustration of what a general sequence / elementary processes (pipeline) of an AI/ML model looks like this:
In the case of image generation, an easily relatable use-case scenario of AI/ML, it seems that the trend is optimizing these various stages of the process. First, an illustration of what it a general sequence / elementary processes (pipeline) of an AI/ML model:
graph LR; A[Input Data] --> B[Noise Addition]; B --> C[Generator Network]; C --> D[Discriminator Network]; D --> E[Output Image]; style A fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style B fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style C fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style D fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000 style E fill:#fff,stroke:#000,stroke-width:2px,rx:10,ry:10,color:#000
Image generation is a good example of cross-dependency by its multidisciplinary nature. Bottlenecks and pain-points are popular expressions that concern a general idea: Problems. And what can you do with problems? Fix them. Solving one issue versus solving the problem seems to be the main pitfall companies developing AI and tech products, because while it keeps teams busy and immediate rewards of resolving small issues, the architectural problem may not be addressed, and if they were, solving one problem could potentially resolve n more issues.
While architectural problems are of interested, they deserve an entire article (or several) to address them in more detail. For now, we may focus on one tangible questions that most (a range of technical and general audiences) can follow and leverage from the information here shared:
What efforts are in place to address existing issues?
0.0.0.1 Error Correction, Hashing & Maps
If we perceive the flow of information and processes through the pipeline as a signal, a phone call, it’s easy to intuit how it is naturally prone to errors, loss, miss-calculations and interferences in the process. Most of us know what is the Telephone game, or the Whispers game, where “which messages are whispered from person to person and then the original and final messages are compared.”. Now imagine that everyone involved is dedicated to really passing the message accurately, but since they speak different languages, this will have a degree of on influence on how the message will be transmitted.
For decades, and actually very important in actual phone-calls are error correction codes, and they are widely used in software development as well.
The concepts of error correction codes, Hamming codes, and hashing in AI/ML, have shared origins with information theory, and although a broad and historically rich research topics, and in spite their complexity, conceptually they are simple to gain some initial intuition on:
Imagine error correction codes as a protective shield or a self-healing fabric wrapped around data (or the message of the Telephone Game analogy above). Just as a self-healing material can repair small tears or damages, error correction codes allow data to recover from minor corruptions during transmission, or even aid the encoder fill in the gaps if data is missing. Hamming codes, in this analogy, are like a specialized patch on this fabric, designed to detect and mend specific types of tears efficiently.
This allows us to consider how error correction techniques play a significant role in speeding up input / output operations, feature retrieval and optimize memory usage.
Hashing, on the other hand, can be viewed as a form of data compression or approximation. It’s akin to creating a thumbnail of an image, how much information is necessary to be passed around? A smaller representation that captures the essence of the original sometimes is enough. A smiley face emoji does not need to be yellow if the intent is to answer is it smiling? : yes or no?
In traditional computing, this thumbnail aims for uniqueness, like a fingerprint. But in AI/ML applications, it’s more like an impressionist painting, capturing the semantic essence rather than exact details.
The evolution of these concepts in AI/ML, particularly in cross-modal hashing and joint Hamming spaces, is like creating a universal language or a common currency for different types of information. It’s as if we’re translating diverse data types: text, images, audio; into a shared alphabet, where similarities can be easily compared regardless of their original form.
All these techniques, at their core, are about efficient representation and manipulation of information. They share a common ancestry in information theory and the fundamental challenge of encoding, transmitting, and decoding data accurately and efficiently. Whether it’s protecting against errors, creating compact representations, or enabling cross-modal comparisons, these methods are different expressions of the same underlying principle: finding optimal ways to handle and process information in a world of imperfect channels, biases, and diverse data types.
In this shared universe analogy, for specialized and customized models, the range of data is somehow related. Visual culture has eras, styles etc. Thus, the Hamming distance of “A woman riding a horse on a prairie” and a “Woman sitting on a porch chair” if the model is specialized in Renascence paintings generation, is small, and their corresponding binary codes shall be small too (Chen et al. 2024b). Thus, joint Hamming spaces allow for a both text and images to be correlated, and their relationship strength to be update given the model’s learning ability.
Highlighting the most recent and used encoders:
- TIME: The TIME (Text-to-Image Model Editing) method can efficiently correct biases in image generator models by editing only about 1.95% of the model’s parameters (Mokady et al., 2023).
- ReFACT: While ReFACT achieves even more precise results by tweaking just 0.25% of parameters while maintaining image quality (Ruiz et al., 2023).
0.1 Cutting edge cross-modal hashing:
- Visual-Textual Prompt Hashing (VTPH): Chen et al. (2024) which integrates both visual and textual prompt learning into cross-modal hashing, addressing limitations like context loss and information redundancy in existing methods. This facilitates semantic coherence between diverse modalities, improving cross-modal retrieval performance.
- Multi-Grained Similarity Preserving and Updating (MGSPU): By (Wu et al. 2024) this unsupervised cross-modal hashing approach combines multi-grained similarity information from local and global views to improve the accuracy of similarity measurements and preserve similarity consistency across modalities.
- Multi-Dimensional Feature Fusion Hashing (MDFFH): (Chen et al. 2024a) proposed the Multi-Dimensional Feature Fusion Hashing (MDFFH), where it constructs multi-dimensional fusion modules in image and text networks to learn multi-dimensional semantic features of data, integrating Vision Transformer with convolution neural networks to fuse local and global information.
- Semantic Embedding-based Online Cross-modal Hashing (SEOCH): Proposed by (Liu et al. 2024) SEOCH addresses the challenges of online learning for streaming data in cross-modal hashing by mapping semantic labels to a latent semantic space and employing a discrete optimization strategy for online hashing.
- Hamming Code-Based Hashing: By (Hinton 2023), this approach leverages the error correction efficiency and simplicity of Hamming codes to enhance data integrity in hash-based systems.