https://github.com/sclbd/deepfakebenchDeepfake detection faces a critical generalization hurdle, with performance deteriorating when there is a mismatch between the distributions of training and testing data. Code: https://github.com/sclbd/deepfakebench
https://github.com/stanford-oval/suqlThis paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Code: https://github.com/stanford-oval/suql
https://github.com/mlc-ai/web-llmFinally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Code: https://github.com/mlc-ai/web-llm
https://github.com/efficient-large-model/vilaVisual language models (VLMs) rapidly progressed with the recent success of large language models. Code: https://github.com/efficient-large-model/vila
https://github.com/cpacker/memgptLarge language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. Code: https://github.com/cpacker/memgpt
https://github.com/ml-gsai/microdreamerIn this paper, we introduce score-based iterative reconstruction (SIR), an efficient and general algorithm for 3D generation with a multi-view score-based diffusion model. Code: https://github.com/ml-gsai/microdreamer
https://github.com/suzgunmirac/meta-promptingThis collaborative prompting approach empowers a single LM to simultaneously act as a comprehensive orchestrator and a panel of diverse experts, significantly enhancing its performance across a wide array of tasks. Code: https://github.com/suzgunmirac/meta-prompting
https://github.com/jinhualiang/wavcraftWe introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Code: https://github.com/jinhualiang/wavcraft
https://github.com/runyiyang/sundaeHowever, this comes with high memory consumption, e. g., a well-trained Gaussian field may utilize three million Gaussian primitives and over 700 MB of memory. Code: https://github.com/runyiyang/sundae
https://github.com/nvlabs/radioA handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. Code: https://github.com/nvlabs/radio
https://github.com/2471023025/ralm_surveyLarge Language Models (LLMs) have catalyzed significant advancements in Natural Language Processing (NLP), yet they encounter challenges such as hallucination and the need for domain-specific knowledge. Code: https://github.com/2471023025/ralm_survey
https://github.com/facebookresearch/lightplaneContemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. Code: https://github.com/facebookresearch/lightplane
https://github.com/prometheus-eval/prometheus-evalProprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. Code: https://github.com/prometheus-eval/prometheus-eval
https://github.com/hvision-nku/storydiffusionThis module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. Code: https://github.com/hvision-nku/storydiffusion
https://github.com/kindxiaoming/pykanInspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). Code: https://github.com/kindxiaoming/pykan
https://github.com/chenyangzhu1/multiboothMultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. Code: https://github.com/chenyangzhu1/multibooth
https://github.com/tothebeginning/pulidWe propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. Code: https://github.com/tothebeginning/pulid
https://github.com/FoundationVision/GromaWe introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Code: https://github.com/FoundationVision/Groma
https://github.com/snap-stanford/starkAnswering real-world user queries, such as product search, often requires accurate retrieval of information from semi-structured knowledge bases or databases that involve blend of unstructured (e. g., textual descriptions of products) and structured (e. g., entity relations of products) information. Code: https://github.com/snap-stanford/stark
https://github.com/MangoKiller/MolTCMolecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Code: https://github.com/MangoKiller/MolTC
https://github.com/magic-research/PLLaVAPLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Code: https://github.com/magic-research/PLLaVA
https://github.com/zzxslp/som-llavaSet-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. Code: https://github.com/zzxslp/som-llava
https://github.com/opengvlab/internvlCompared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code: https://github.com/opengvlab/internvl
https://github.com/ToruOwO/hatoTwo significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. Code: https://github.com/ToruOwO/hato
https://github.com/JackAILab/ConsistentIDConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Code: https://github.com/JackAILab/ConsistentID
https://github.com/microsoft/FILMWhile many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. Code: https://github.com/microsoft/FILM
https://github.com/dcharatan/flowmapThis paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Code: https://github.com/dcharatan/flowmap
https://github.com/ericlbuehler/mistral.rsStarting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks. Code: https://github.com/ericlbuehler/mistral.rs
https://github.com/apple/corenetContrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. Code: https://github.com/apple/corenet
https://github.com/lean-dojo/leancopilotIn this paper, we explore LLMs as copilots that assist humans in proving theorems. Code: https://github.com/lean-dojo/leancopilot
https://github.com/princeton-vl/multislam_diffposeThe backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. Code: https://github.com/princeton-vl/multislam_diffpose
https://github.com/holmeswww/agentkitThe chains of nodes can be designed to explicitly enforce a naturally structured "thought process". Code: https://github.com/holmeswww/agentkit
https://github.com/ailab-cvc/seed-xWe hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. Code: https://github.com/ailab-cvc/seed-x
https://github.com/Jyxarthur/flowsamThe objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. Code: https://github.com/Jyxarthur/flowsam
https://github.com/fasterdecoding/snapkvSpecifically, SnapKV achieves a consistent decoding speed with a 3. 6x increase in generation speed and an 8. 2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. Code: https://github.com/fasterdecoding/snapkv
https://github.com/ez-hwh/autocrawlerWe propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Code: https://github.com/ez-hwh/autocrawler
https://github.com/hiyouga/llama-factoryWe propose a new metric to assess personality generation capability based on this evaluation method. Code: https://github.com/hiyouga/llama-factory
https://github.com/id-animator/id-animatorBased on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Code: https://github.com/id-animator/id-animator
https://github.com/yisol/IDM-VTONFinally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Code: https://github.com/yisol/IDM-VTON
https://github.com/apple/corenetTo this end, we release OpenELM, a state-of-the-art open language model. Code: https://github.com/apple/corenet
https://github.com/Infini-AI-Lab/TriForceHowever, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Code: https://github.com/Infini-AI-Lab/TriForce
https://github.com/microsoft/recaiThis paper introduces RecAI, a practical toolkit designed to augment or even revolutionize recommender systems with the advanced capabilities of Large Language Models (LLMs). Code: https://github.com/microsoft/recai
https://github.com/FaceOnLive/Face-Liveness-Detection-SDK-LinuxSPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Code: https://github.com/FaceOnLive/Face-Liveness-Detection-SDK-Linux
https://github.com/liming-ai/ControlNet_Plus_PlusTo this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Code: https://github.com/liming-ai/ControlNet_Plus_Plus
https://github.com/Recognito-Vision/NIST-FRVT-Top-1-Face-RecognitionIn the realm of security applications, biometric authentication systems play a crucial role, yet one often encounters challenges concerning privacy and security while developing one. Code: https://github.com/Recognito-Vision/NIST-FRVT-Top-1-Face-Recognition
https://github.com/facebookresearch/llm-transparency-toolWe present the LM Transparency Tool (LM-TT), an open-source interactive toolkit for analyzing the internal workings of Transformer-based language models. Code: https://github.com/facebookresearch/llm-transparency-tool
https://github.com/facebookresearch/generative-recommendersLarge-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Code: https://github.com/facebookresearch/generative-recommenders
https://github.com/Beomi/InfiniTransformerThis work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. Code: https://github.com/Beomi/InfiniTransformer
https://github.com/Recognito-Vision/Face-SDK-Linux-DemosThis enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. Code: https://github.com/Recognito-Vision/Face-SDK-Linux-Demos
https://github.com/siyan-zhao/prepackingIn this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. Code: https://github.com/siyan-zhao/prepacking