Detects tables in an image and returns the precise coordinate points of the detected table, while accurately extracting and redrawing the entire table to preserve its structure. The model identifies the coordinates of each individual cell and performs Optical Character Recognition (OCR) on each cell separately to capture the data effectively.
Tools/Technologies:
Computer vision models- Yolo, Microsoft Tatr, Tablenet (based on OpenCV). OCR models- Pytesseract, Paddle-OCR, Easy-OCR, DocTR
Technical Details:
•Gradio UI for table detection.
•Compares YOLO, Microsoft TATR, and Tablenet for detection.
•YOLO detects tables in images, returning coordinates and confidence scores for each detected table. TATR analyzes table structures and outputs the coordinates and confidence scores for table components.
•Tablenet utilizes OpenCV independently for image preprocessing, enhancing table detection accuracy by adjusting image properties.
•Returns cropped detected tables with confidence scores.
•Table images are extracted into CSV format.
•Cells in the table image are cropped using coordinates from the Tablenet method.
•OCR is performed on each cell using Pytesseract, Paddle-OCR, Easy-OCR, and DocTR, returning the best OCR output based on the confidence score.
•With the cell coordinates and their OCR outputs, a Pandas DataFrame is created and converted into CSV.
Potential Use Cases:
•Document digitization for automated data entry.
•Automated report generation from tables in research papers.
•Content extraction from financial statements for analysis.
•Academic research data extraction from publications.
S.No
Use Case
Description
Tools/Technologies
Potential Use Cases
Technical Details
Demo Links
01
Table Detection and Extraction
Detects tables in an image and returns the precise coordinate points of the detected table, while accurately extracting and redrawing the entire table to preserve its structure. The model identifies the coordinates of each individual cell and performs Optical Character Recognition (OCR) on each cell separately to capture the data effectively.
Computer vision models- Yolo, Microsoft Tatr, Tablenet (based on OpenCV). OCR models- Pytesseract, Paddle-OCR, Easy-OCR, DocTR
•Document digitization for automated data entry.
•Automated report generation from tables in research papers.
•Content extraction from financial statements for analysis.
•Academic research data extraction from publications.
•Gradio UI for table detection.
•Compares YOLO, Microsoft TATR, and Tablenet for detection.
•YOLO detects tables in images, returning coordinates and confidence scores for each detected table. TATR analyzes table structures and outputs the coordinates and confidence scores for table components.
•Tablenet utilizes OpenCV independently for image preprocessing, enhancing table detection accuracy by adjusting image properties.
•Returns cropped detected tables with confidence scores.
•Table images are extracted into CSV format.
•Cells in the table image are cropped using coordinates from the Tablenet method.
•OCR is performed on each cell using Pytesseract, Paddle-OCR, Easy-OCR, and DocTR, returning the best OCR output based on the confidence score.
•With the cell coordinates and their OCR outputs, a Pandas DataFrame is created and converted into CSV.
02
Health Care Doc Chat
Processes a PDF document as input and uses Retrieval-Augmented Generation (RAG) to answer queries related to the content of the uploaded PDF. It converts the PDF into embedding chunks using an embedding model. When a query is made, the model retrieves the relevant chunks from the embedded data and generates an accurate answer based on the retrieved information. This solution is tailored for healthcare-related PDFs, such as medical reports, clinical guidelines, or patient records.
LlamaIndex, Embedding models, LLMs, RAG
•Clinical Decision Support Systems that provide healthcare professionals with quick answers from medical reports and guidelines.
•Patient Record Management for efficient retrieval of relevant information during consultations or treatment planning.
•Medical Research tools enabling researchers to query large datasets of clinical guidelines and reports for insights.
•Healthcare Compliance Monitoring to ensure that clinical practices align with the latest guidelines by retrieving relevant information from PDFs.
•ChatGPT-4 pdf chat clone on our local server.
•Upload any PDF and ask queries on that PDF
•used LangChain for performing RAG (Retrieval-Augmented
•Generation)
•Used Embedding models for generating embeddings.
•Used chroma for storing embeddings , also stored embedding on
•local machine for avoiding dependency of vector store.
•Streamlit for building UI.
03
Payslip Doc Chat
Processes a PDF payslip as input and utilizes Retrieval-Augmented Generation (RAG) to answer queries related to the content of the uploaded payslip. It converts the PDF into embedding chunks using an embedding model. When a query is made, the model retrieves relevant chunks from the embedded data and generates accurate answers based on the retrieved information. This solution is specifically tailored for payslip-related PDFs, enabling users to gain insights into their earnings, deductions, and other relevant details.
LlamaIndex, Embedding models, LLMs, RAG
•Personal Finance Management tools that allow users to analyze their payslips for budgeting and financial planning.
•Payroll Inquiry Systems enabling employees to ask specific questions about their earnings and deductions directly from their payslips.
•Tax Preparation Applications that help users understand tax deductions and withholdings detailed in their payslips.
•HR Support Services for answering common payroll-related questions based on employees' uploaded payslips.
•ChatGPT-4 pdf chat clone on our local server.
•Upload any PDF and ask queries on that PDF
•used LangChain for performing RAG (Retrieval-Augmented
•Generation)
•Used Embedding models for generating embeddings.
•Used chroma for storing embeddings , also stored embedding on
•local machine for avoiding dependency of vector store.
•Streamlit for building UI.
04
Invoice Doc Chat
It processes PDF invoices to efficiently respond to user queries about their content. By converting invoices into embedding chunks using an embedding model, it leverages Retrieval-Augmented Generation (RAG) to extract and provide precise answers. When a query is submitted, the model retrieves relevant information from the embedded chunks, allowing users to gain insights into billing details, payment statuses, itemized charges, and other crucial information found within the invoice. This tailored solution streamlines invoice management and enhances financial data accessibility.
LlamaIndex, Embedding models, LLMs, RAG
•Expense Management Tools that help users track and understand itemized charges and payment statuses from their invoices.
•Financial Auditing Applications where auditors can query invoices for specific details during financial reviews.
•Budget Tracking Systems that analyze invoices to provide insights into spending patterns and cost management.
•Subscription Management Services allowing users to query invoices related to recurring payments and changes in billing.
•ChatGPT-4 pdf chat clone on our local server.
•Upload any PDF and ask queries on that PDF
•used LangChain for performing RAG (Retrieval-Augmented
•Generation)
•Used Embedding models for generating embeddings.
•Used chroma for storing embeddings , also stored embedding on
•local machine for avoiding dependency of vector store.
•Streamlit for building UI.
05
Virtual Outfit Try-On with Pose-Aware Fitting and Realistic Visualization
This diffusion model takes an image of a person and an outfit image to visualize how the person would look wearing that outfit. It accurately detects and analyzes the pose of the person, ensuring a realistic representation. The model then seamlessly fits the outfit to the individual, adjusting for body proportions and pose dynamics. This technology offers a novel way to experience fashion, allowing users to see themselves in various outfits without trying them on physically.
Diffusers, OpenPose models, human parsing model, ViT-large, OOT model
•Virtual Fitting Rooms for online retailers, allowing customers to visualize outfits on themselves before purchasing.
•Fashion Design Tools enabling designers to showcase their collections on diverse body types and poses.
•Event Preparation Platforms for individuals to visualize outfits for specific occasions like weddings, parties, or interviews.
•Clothing Rental Services that provide virtual try-ons to enhance user confidence in renting apparel.
•Virtual Try on of clothes ,Try on any cloth by uploading image of person and image.
•Used Openpose model for dectecting person's poses.
•Used parsing model for parsing human separately from entire image.
•Used google/ViT-large for image captioning.
•Used OOT model for attaching person image and cloth image.
•Have separate tabs for upper clothing , lower clothing, full-dress clothing and upper-lower combined .
06
QR Extraction & Scanner
Finds QR and Bar codes in images and extracts them, then scans to give their URLs. Includes tabs for QR, Bar code, and OCR for extracting card details.
•Mobile Payment Solutions allowing users to scan QR codes for quick and secure transactions at retail locations.
•Event Ticketing applications that enable users to scan QR codes on tickets for entry verification at events.
•Business Card Management tools that extract contact details from scanned cards and add them to digital address books.
•Marketing Campaigns utilizing QR codes to direct users to promotional content or special offers quickly.
•Gradio Tabs UI for QR,Bar code scanner and card ocr.
•Used OpenCV for QR detection and decoding.
•Used Pyzbar for bar code reading.
•Used paddle-ocr for Card ocr,it returns ocr line by line separately.
07
SQL Coder: Text To SQL
Converts text queries to SQL code, enabling users to interact with databases using plain language instead of writing SQL code.
VLLM, paulml/sqlcoder-7b-2-awq (Text to SQL model)
•Business Intelligence Tools that allow non-technical users to query databases using natural language for generating reports and insights.
•Data Analysis Platforms enabling analysts to retrieve and manipulate data without needing extensive SQL knowledge.
•Educational Tools for teaching database management and SQL concepts by allowing students to practice querying databases in plain language.
•Chatbot Integrations where users can ask questions about data, and the bot converts those queries into SQL for backend processing.
•Text to SQL code LLM demo UI.
•Converts natural language descriptions into SQL queries.
•Generates SQL queries for data retrieval and reporting tasks.
•Simplifies complex query creation for data analysis purposes.
•Used VLLM for faster inference.
•Used paulml/sqlcoder-7b-2-awq model
08
Google-calendar agent
An AI agent chat interface that creates, deletes and lists events within a particular time limit. It checks if a person is free or busy, and lists available free schedules on Google Calendar using API calls.
Function calling, LLMs
•Personal Assistant Applications that help users manage their schedules by creating, deleting, and listing events based on their availability.
•Team Collaboration Tools allowing team members to quickly find common free times for meetings, enhancing productivity and communication.
•Event Planning Platforms where users can easily check participants' availability and schedule events without lengthy back-and-forth communications.
•Online Booking Systems enabling businesses to manage appointments efficiently, ensuring optimal use of resources and time slots.
•Developed an AI agent chat interface that interacts with Google Calendar through API calls.
•Enabled functionalities such as creating events, listing events for specific time periods, checking availability, and finding free schedules.
•Leveraged function calling with LLMs to process user requests and manage calendar data efficiently.
•Designed the agent to interactively request missing information from users, ensuring accurate and complete event management.
•Implemented function calling to dynamically adapt to varying user input, enhancing the flexibility and responsiveness of the agent.
•Integrated error-handling mechanisms within function calls to manage API limitations and ensure seamless user experience.
09
Voice to Voice
This is an interactive voice-to-voice system that allows users to engage in natural conversations. The system captures spoken input through microphone, transcribes it into text using a speech-to-text model (OpenAI's Whisper). This text is then processed by an LLM(llama-3.1-70b-versatile) to generate a contextually appropriate response. Finally, the response is converted back into speech using a text-to-speech model(facebook/mms-tts-eng), enabling verbal communication with the system.
Speech Recognition: openai/whisper-large-v2 Language Model (LLM) Response Generation: llama-3.1-70b-versatile (by Groq) Text-to-Speech (TTS): facebook/mms-tts-eng User Interface (UI): Framework: Gradio
•Smart Home Interfaces enabling users to control devices and manage their home environment through voice commands and responses.
•Interactive Gaming Experiences where players can engage with characters and make choices through natural voice dialogue, enhancing immersion.
•Travel Assistants that provide real-time information and recommendations during trips, allowing travelers to ask for help and receive verbal guidance.
•Research and Data Collection tools that enable users to engage in voice-based surveys or interviews, capturing responses in a natural conversational format.
•Voice to voice interaction with llm demo.
•Speech Recognition: Uses OpenAI's whisper-large-v2 for accurate transcription.
•LLM Response Generation: Employs Groq's llama-3.1-70b-versatile for contextual responses.
•Text-to-Speech (TTS): Utilizes Facebook's mms-tts-eng for natural speech output.
•UI: Built with Gradio for an interactive user experience.
•Integration: Combines these components for seamless voice interaction.
10
HR Bot
This is an interactive voice call conversational AI system designed to confirm basic candidate details and conduct pre-interview calls. The system utilizes Twilio to initiate calls and manages the conversation using the Groq 'llama-3.1-70b-versatile' model. Twilio converts user speech to text, which is then sent to the LLM to generate a response. The response is relayed back to Twilio, which plays it during the call, facilitating a natural conversation. The LLM is specifically prompted to emulate an HR representative and ask relevant questions.
Twilio, Groq, Langchain, Flask
•Automated Candidate Screening that streamlines the hiring process by conducting pre-interview calls to verify candidate information and assess basic qualifications.
•Recruitment Process Optimization where HR departments can reduce time spent on initial candidate outreach by automating the confirmation of details.
•Interview Scheduling Assistants that facilitate the coordination of interviews between candidates and HR representatives by handling preliminary questions.
•Job Fair Virtual Assistants that engage with candidates during online job fairs, answering questions and collecting their information efficiently.
•Call, Speech Transcription and Text to Speech : Twilio
•Language Model (LLM): Response Generation: llama-3.1-70b-versatile (by Groq)
•Langchain: For prompt template and LLM Memory
•User Interface (UI): Flask
11
Realtime Speech Recognition
Real-time speech recognition converts spoken language into text instantaneously, enabling fast, accurate voice-to-text applications.
Whisper, librosa, faster-whisper
•Live transcription services, voice assistants, real-time captioning for meetings, voice commands for apps, hands-free communication tools.
•Real-time models use advanced neural networks, trained on large speech datasets. They capture audio in short intervals, process it with ASR models, and output text in real time with minimal latency.
12
Llama-Index App Is All You Need
This is an easiest way to use Agentic RAG in any enterprise.. Flexible Dashboard to choose desired LLM from Respective providers such as openai, Groq and provide custom System Prompt with Websearch, Code and Image Generation.
•Enterprises can use this platform to automate customer interactions by selecting an appropriate LLM for handling common customer queries
•Companies can use the RAG platform to gather market insights by querying the web for the latest data on competitors, industry trends, and customer feedback.
•Marketing teams can generate creative content, from blog posts to advertisements.
•Development teams can leverage code generation to write, optimize, and debug code.
•It can use Knowledge Base
•Web search - duckduckgo and Wikipedia,
•Code Generation - E2B
•Image Generation - Stability AI
13
Video QA
Accepts a video as input, converts the video into audio, then transcribes the audio into text. It can answer questions related to the content mentioned in the video using Retrieval-Augmented Generation (RAG) for more detailed and accurate responses.
LLM on VLLM, Whisper, VAD (Voice Activation Detection), LlamaIndex for RAG, Streamlit
•Video Content Analysis for educational purposes, allowing students to ask questions about lecture videos.
•Customer Support using video tutorials for product demonstrations and enabling users to ask questions on specific parts.
•Research Data Collection where recorded interviews are transcribed and analyzed for thematic insights.
•Webinars where attendees can interactively ask questions about the video content during or after the presentation.
•Video-chat on our local server.
•upload any video and ask queries on that video's content.
•Video to audio ,Transcription with speaker number using WhisperX.
•Used llama-index for RAG(Retrievel Augmented Generation) on transcription .
•Streamlit for UI
14
Advanced Vision-Language Model for Image Querying
A vision-language model that accepts an image as input and provides detailed answers to queries about the image. It supports multiple output formats, including JSON and markdown, and offers thorough image descriptions. It leverages the current best model for image-to-text use cases, ensuring accuracy and versatility in interpretation.
CogVLM (vision language model), Streamlit
•Educational Tools for students to query images from textbooks or lecture slides for better understanding.
•Social Media Management enabling users to analyze and generate captions for images before posting.
•E-commerce Platforms where customers can upload product images and ask for information or recommendations.
•Medical Imaging allowing healthcare professionals to upload images and ask specific questions about diagnoses.
•Chainlit UI for image + text to text model Cogvlm.
•Cogvlm,a vision language model.
•upload any image and ask queries on the image.
•OCR ,Image captioning ,Visual Question Answering ,Image-Driven Text Generation are major usecase.
15
Newspaper Image Classification
A model trained to identify advertisements in Arabic newspapers using a supervised fine-tuning approach. The model is based on ViT-base (Vision Transformer) and utilizes a classification method to accurately separate target advertisements from other content in the newspaper images.
Image classification, Google ViT-base, SFT (Supervised Fine Tuning)
•Ad Revenue Optimization for Arabic newspapers by identifying and classifying ads to enhance placement strategies.
•Content Analysis Tools for publishers to assess the volume and effectiveness of advertisements in printed materials.
•Market Insights platforms that analyze trends in advertising across different regions in Arabic-speaking countries.
•Automated Clipping Services for ad agencies to extract and archive advertisements from newspaper images.
•Trained using Google's ViT-base (Vision Transformer): Utilized the ViT-base model, a state-of-the-art transformer architecture for image classification tasks.
•Finds and Classifies Target Ads: Accurately identifies and differentiates target ads from a diverse set of other images.
•Supervised Fine-Tuning Method: Applied a supervised fine-tuning approach to optimize the model's performance on a specific dataset.
•Custom Dataset Preparation: The model was trained on a carefully curated dataset, ensuring a balanced representation of both target ads and non-ad images.
•High Accuracy and Robustness: Achieved high accuracy in classification tasks, with robustness against variations in ad formats and image quality.
16
TTS: Text To Speech
This Gradio application features a user-friendly tabbed interface for exploring four of the best open-source text-to-speech (TTS) models. Users can select from a variety of models, each showcasing unique voice qualities, languages, and capabilities. The interface allows users to input text, select their desired TTS model, and listen to the generated speech output in real time. This application aims to provide a seamless and interactive experience for those looking to experiment with different TTS technologies for various applications.
Parlor-TTS, XTTS_v2, Suno-Bark, Suno-Bark (with sample voice as input)
•Language Learning Tools that allow users to hear pronunciations and intonations in different languages using various TTS models.
•Content Creation Platforms where writers can listen to their text read aloud, helping them refine tone and clarity.
•Audiobook Production tools that enable authors to generate voice recordings of their texts with multiple voice options.
•Voiceover Services for video creators and marketers to generate quick voiceovers without needing professional recording equipment.
•Gradio UI for text to speech models.
•Parlor-TTS :lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc)
•Xtts_v2 : Text to speech model, it can also clone the voice and supports 17 language.
•Suno-bark : transformer-based text-to-audio model. It can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.
•Suno-bark with voice input.
17
Accelerating LLM Inference with Multi-Token Prediction
A Facebook model demo implementing the paper 'Better and Faster LLM via Multi-Token Prediction.' This model enables faster inference through self-speculative decoding.
•Real-Time Chatbots that provide instantaneous responses in customer service applications, enhancing user experience.
•Social Media Management Tools that assist users in crafting posts or replies with fast, context-aware suggestions.
•Virtual Assistants that can interpret user commands and respond rapidly, facilitating smoother user interactions.
•Education Platforms that offer instant feedback on student inputs, enhancing engagement in learning environments.
•Gradio UI for meta's multi-token prediction model.
•Implementation of research paper ""Better & Faster Large Language Models via Multi-token Prediction"".
•Predicts multiple tokens at the same time.
•3x faster than regular next-token prediction models.
•Uses torch run for inference.
18
Minference: Optimized LLM Inference for Token-Rich Inputs
Speeds up the inference of LLMs when processing inputs with more tokens. Supports specific models: LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K.
Minference, LLMs
•High-Volume Data Processing for analytics platforms that need to quickly analyze large datasets with extensive textual inputs.
•Conversational AI applications that handle lengthy user queries, providing timely and relevant responses in real time.
•Content Moderation Systems that can rapidly evaluate large volumes of user-generated content for compliance and safety.
•Document Summarization Tools that efficiently condense long reports, articles, or legal documents into concise summaries.
•Gradio chat interface UI for Minference(Million token inference).
•Speeds up the inference of llms when processing the inputs with more tokens.
•Supports LLaMA-3, GLM4, Yi, Phi-3, and Qwen2 models.
•Implementation of the paper "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" by Microsoft.
19
STT: Speech To Text
Demo with different tasks using speech to text, including audio file to transcription, microphone to transcription, live stream transcription, YouTube link to transcription, translation, and transcription with time-stamping.
Faster Whisper
•Transcription Services that convert audio files from meetings or interviews into written text for documentation.
•Real-Time Captioning for virtual events and webinars, providing immediate text representation of the spoken content.
•YouTube Video Enhancements allowing creators to automatically generate transcripts from their videos, making content more accessible.
•Meeting Summarization Tools that capture discussions in real time, enabling participants to focus on conversation rather than note-taking.
•Gradio UI Tabs for Speech to text tasks.
•Audiofile to transcription : upload audiofile it will generate the transcription
•Microphone to transcription : Turn on mic to speak then it will generate transcription.
•Live stream transcription : transcriptions generated lively when started to speaking.
•Youtube link transcription : paste the youtube video link for transcription.
•Translation : speak in any language , translated English transcription will be generated.
•Time-stamping : transcription of audio with start and end time of each word is mentioned.
20
An API Endpoint for Efficient Text-to-Code Generation
Creates a VLLM server for the Codestral model, establishing an endpoint for using the model similar to OpenAI API calls. Codestral model is best for coding-related tasks, particularly text to code.
VLLM, Codestral
•Automated Code Generation Tools that allow developers to generate boilerplate code or entire functions by describing functionality in plain language.
•Debugging Assistance tools that help developers identify and fix code issues by translating user descriptions of problems into suggested fixes.
•Low-Code/No-Code Platforms that empower non-technical users to build applications by generating code through natural language inputs.
•Code Review Platforms where users can ask questions about code performance or functionality, receiving generated code improvements in response.
•Vllm server for coding model , codestral
•Used vllm for creating openai-like api server.
•Vllm provides an HTTP server that implements OpenAI’s Completions and Chat API.
•Used codestral ,the current best coding model for coding related queries.
•Codestral ,helps developers write and interact with code through a shared instruction and completion API endpoint.
21
Speaker Identification and Named Transcription for Multi-Speaker Audio
Utilizes WhisperX diarization to identify the number of speakers in an audio recording and capture their dialogues. This system allows for the naming of speakers and generates a transcription that includes speaker names along with their respective dialogues.
WhisperX, Diarization, Streamlit
•Meeting Transcription Services that automatically identify and label speakers in recorded business meetings, enhancing documentation and follow-up processes.
•Podcast Production Tools that provide detailed transcriptions of discussions, attributing dialogue to individual hosts or guests for improved listener engagement.
•Interview Analysis Applications that transcribe interviews while distinguishing between the interviewer and interviewee, aiding in content review and analysis.
•Focus Group Research where discussions are recorded and transcribed with speaker identification for better analysis of participant contributions.
•Used whisper_large-v3 for audio transcription.
•Used whisperX for diarization,finds number of speakers and start,end of their words.
•Used logic for tagging speakers to their sentences.
•Used Streamlit for UI.
22
Agentic-RAG
Employs LlamaIndex's built-in agentic RAG (Retrieval-Augmented Generation) methods, including L1—a query engine mechanism for selecting tools, L2—directly passing tools to the LLM, L3—a ReAct agent (Reason and Action agent), and L4—support for processing multiple PDFs on ReAct agent.
LlamaIndex, Agents, tool calling, RAG
•Document Analysis Platforms that allow users to query and retrieve information from multiple PDFs using the agentic RAG methods for efficient data extraction.
•Legal Research Tools enabling lawyers to search through extensive case files and legal documents, leveraging RAG to provide accurate, context-aware responses.
•Healthcare Documentation Systems that help medical professionals quickly find relevant information in patient records and clinical guidelines across multiple PDF formats.
•Content Management Systems allowing users to interact with a library of PDFs, using RAG methods to find specific information and generate summaries.
•Explored types of agentic RAG methods in llama-index.
•L1-method : Uses Query engine technique for assisting llm for tool calling.
•L2-method : Directly pass tools to llm.
•L3-method : ReAct agent (Reason and action agent) for RAG.
•L4-method : ReAct agent for multiple pdfs.
23
Graph-Based Interface for Designing and Executing Diffusion Pipelines
Allows designing and executing advanced diffusion pipelines using a graph/nodes/flowchart-based interface. Provides full control over the pipeline of diffuser models like CLIP, UNets, ControlNets, VAE.
Diffusers, ComfyUI
•Art and Design Applications that enable artists to create complex visual outputs by chaining together different diffusion models in customizable workflows.
•Personalized Content Creation tools for influencers and marketers that let users create unique visuals tailored to their branding by manipulating diffusion parameters.
•Photography Enhancement Software using advanced diffusion techniques to transform and stylize images based on user-defined parameters and models.
•Product Visualization Tools for e-commerce platforms that allow users to create and view product representations in different styles and contexts using diffusion models.
•Comfy UI clone on our local server.
•Node-Based Interface: Visual workflow design with nodes/graphs.
•Model Support: Compatible with SD1.x, SD2.x, SDXL, Stable Video Diffusion, Stable Cascade, SD3, and Stable Audio.
•Efficient Execution: Re-executes only changed workflow parts.
•Customizable Pipeline: Full control over CLIP, UNets, ControlNets, and VAE components.
24
UI for Training Diffuser Adaptors
A Gradio UI for training diffuser adaptors like LoRA, DreamBooth, and textual inversion. It includes tabs for specifying training parameters, captioning, and testing trained models.
Diffusers, LoRA, DreamBooth, Textual inversion
•Custom AI Model Training for artists and designers to create personalized diffusion models that reflect their unique styles and preferences.
•Interactive Game Development where developers can train models to generate characters or assets specific to their game’s style or narrative.
•Fashion Design Software enabling designers to train models on specific clothing styles or trends, generating realistic visual representations for collections.
•Virtual Photography tools that allow users to create and test models that generate photorealistic images based on their specified training parameters.
•Kohya_ss Gradio UI for diffuser models.
•Training Methods: Supports Lora, textual inversion, and DreamBooth techniques for model training.
•Parameter Configuration: Includes all parameters necessary for configuring and fine-tuning training.
•Dataset Management: Allows users to create and manage image datasets directly within the UI.
•Image Captioning: Provides functionality for image captioning as part of the training process.
•End-to-End Training: Facilitates a complete training workflow from dataset creation to model evaluation.
25
Form Extractor Prototype
Develops a user interface (UI) for uploaded forms in JPG or PDF format. This system replicates the form structure in JSON and generates UI based on that structure.
Claude or OpenAI, JavaScript
•Digital Form Automation that converts paper forms into interactive online forms, streamlining data collection and processing for businesses.
•Survey and Feedback Tools allowing organizations to digitize paper-based surveys and gather responses more efficiently through a user-friendly interface.
•Document Management Systems that enable users to upload forms, automatically extracting and structuring data for easy retrieval and analysis.
•Human Resources Platforms enabling the digitization of employee onboarding forms, making it easier to gather and store necessary information.
•Upload Forms: Users upload image or PDF forms to the system.
•Extract Data: Claude AI processes the forms to extract details and structure them.
•Save as JSON: Extracted details are saved in a structured JSON format.
•Generate UI Code: Claude AI generates JavaScript code to create a user interface based on the JSON data.
•Interactive Form: The generated UI allows users to view and edit the extracted form details.
26
Comprehensive Image Annotation UI
It is an user interface (UI) for image annotation, which is essential for creating datasets for image detection models. This tool effectively manages annotation tasks for large datasets and serves as an alternative to Roboflow.
CVAT.ai
•Machine Learning Projects that require labeled datasets for training image detection models, helping researchers and developers annotate images efficiently.
•Autonomous Vehicle Development where annotated images are crucial for training models to recognize and respond to various road signs and obstacles.
•Medical Imaging Analysis enabling healthcare professionals to annotate images of X-rays or MRIs, facilitating the development of diagnostic models.
•Agriculture Technology Solutions for annotating images of crops or livestock to develop models that monitor health, growth, and yield.
•CVAT AI clone on our local server
•Image Annotation Tool: Use the tool to annotate images for computer vision model training.
•Task Management: Efficiently manage and organize image annotation tasks for large datasets.
•Format Conversion: Export annotated data in the required format for training image detection models.
•Dataset Preparation: Generate datasets that are ready for use in training and validating image detection models.
27
WebUI (automatic-1111)
A GUI for using diffuser models, providing full control over the pipeline of diffuser models like CLIP, UNets, ControlNets, VAE, and adapters like LoRA and textual inversion.
Diffusers
•Interactive Design Platforms enabling graphic designers to apply different styles and techniques to their projects through an intuitive graphical interface.
•Prototyping Tools for AI Art that let users quickly iterate on visual concepts and styles, using various diffusion models to achieve desired effects.
•Virtual Reality Content Creation where developers can create immersive environments by leveraging diffusion models to generate realistic visuals.
•Film and Animation Production enabling animators to generate backgrounds or character designs quickly, streamlining the creative process.
•Automatic-111 WEBUI clone on our local server.
•Model Management: Easily switch between and manage different Stable Diffusion models.
•Image Generation: Configure parameters for generating images based on text prompts, resolution, and seed settings.
•Inpainting: Edit specific areas of generated images for refinement.
•ControlNet Integration: Utilize ControlNet models for additional control over image generation.
•Custom Plugins: Extend functionality with customs adapters like LoRAs.
28
Versatile Framework for Building Voice Conversational Agents
A framework for building voice conversational agents, such as personal coaches, meeting assistants, customer support bots, intake flows, and social companions.
Daily.co, Groq, Text-to-speech, Whisper
•Personal Health Coaches that provide users with fitness tips, meal planning advice, and motivation through interactive voice conversations.
•Meeting Assistants that help users schedule, manage, and summarize meetings by taking voice commands and providing real-time updates.
•Onboarding Solutions for new employees, guiding them through company policies and procedures using a conversational voice interface.
•Telehealth Services where voice agents assist patients in scheduling appointments, accessing medical records, and providing health advice.
•pipecat clone for video call chat with LLM.
•Voice to Voice: Uses Pipecat to manage the end-to-end voice interaction pipeline.
•Video Chat Interface: Implemented with Daily.co for the video call meeting interface.
•Speech-to-Text: Utilizes Whisper for converting speech into text.
•LLM Response: Employs Groq for generating contextual responses.
•Text-to-Speech: Utilizes Eleven Labs for natural speech synthesis.