audio/multimodal papers
A Folder from Siddish
Favicon
LAION-AI/CLAP: Contrastive Language-Audio Pretraining
Favicon
Sreyan88/GAMA: Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Favicon
THUDM/GLM-4: GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Favicon
[2306.12925] AudioPaLM: A Large Language Model That Can Speak and Listen
Favicon
[2402.13236] Towards audio language modeling -- an overview
Favicon
[2308.12792] Sparks of Large Audio Models: A Survey and Outlook
Favicon
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data - Amazon Science
Favicon
[2309.11000] Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model
Favicon
lzw-lzw/GroundingGPT: [ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
Favicon
[2405.05945] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Favicon
[2310.08715] Toward Joint Language Modeling for Speech Units and Text
Favicon
kyegomez/CM3Leon: An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
Favicon
[2402.12654] OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
Favicon
arxiv.org/pdf/2211.16866
Favicon
arxiv.org/pdf/2407.09732
Favicon
aask1357/hilcodec: High fidelity, lightweight, end-to-end, streaming, convolution-based neural audio codec
Favicon
zhenye234/xcodec: X-Codec: Unified Audio Tokenizer for Audio Language Model
Favicon
X-LANCE/SLAM-LLM: Speech, Language, Audio, Music Processing with Large Language Model
Favicon
Linear95/SPAG: Self-playing Adversarial Language Game Enhances LLM Reasoning
Favicon
[2405.04752] HILCodec: High Fidelity and Lightweight Neural Audio Codec
Favicon
[2405.00233] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Favicon
[2401.03497] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Favicon
[2310.04673] LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
Favicon
[2310.13289] SALMONN: Towards Generic Hearing Abilities for Large Language Models
Favicon
[2309.07937] Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Favicon
[2310.00704] UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Favicon
[2310.00230] SLM: Bridge the thin gap between speech and text foundation models
Favicon
[2402.01831] Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Favicon
[2402.05755] SpiRit-LM: Interleaved Spoken and Written Language Model
Favicon
[2402.12226] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Favicon
[2402.08846] An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Favicon
[2404.00656] WavLLM: Towards Robust and Adaptive Speech Large Language Model
Favicon
[2305.13009] Textually Pretrained Speech Language Models
Favicon
[2307.03917] On decoder-only architecture for speech-to-text and large language model integration
Favicon
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Proceedings of the AAAI Conference on Artificial Intelligence
Favicon
openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf
Favicon
[2305.11000] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Favicon
[2306.12925] AudioPaLM: A Large Language Model That Can Speak and Listen
Favicon
proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf
Favicon
Flamingo: a Visual Language Model for Few-Shot Learning
Favicon
ViTAE-Transformer/QFormer: The official repo for [TPAMI'23] "Vision Transformer with Quadrangle Attention"
Favicon
AudioLLMs/AudioLLM: Audio Large Language Models
Favicon
[PDF] BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing | Semantic Scholar
Favicon
[PDF] A Full-duplex Speech Dialogue Scheme Based On Large Language Models | Semantic Scholar
Favicon
[PDF] AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Semantic Scholar
Favicon
[PDF] Audio-visual training for improved grounding in video-text LLMs | Semantic Scholar
Favicon
[PDF] BLSP-Emo: Towards Empathetic Large Speech-Language Models | Semantic Scholar
Favicon
[PDF] Boosting Large Language Model for Speech Synthesis: An Empirical Study | Semantic Scholar
Favicon
[PDF] Connecting Speech Encoder and Large Language Model for ASR | Semantic Scholar
Favicon
[PDF] DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment | Semantic Scholar
Favicon
[PDF] Transferable speech-to-text large language model alignment module | Semantic Scholar
Favicon
[PDF] SpeechVerse: A Large-scale Generalizable Audio Language Model | Semantic Scholar
Favicon
[PDF] Sparks of Large Audio Models: A Survey and Outlook | Semantic Scholar
Favicon
[PDF] WavLLM: Towards Robust and Adaptive Speech Large Language Model | Semantic Scholar
Favicon
[PDF] SALMONN: Towards Generic Hearing Abilities for Large Language Models | Semantic Scholar
Favicon
[PDF] DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding | Semantic Scholar
Favicon
[PDF] Language Model Can Listen While Speaking | Semantic Scholar
Favicon
Language Model Can Listen While Speaking
Favicon
arxiv.org/pdf/2203.16502
Favicon
arxiv.org/pdf/2408.02622
Favicon
[2405.19487] A Full-duplex Speech Dialogue Scheme Based On Large Language Models
Favicon
[2406.15718] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models
Favicon
[PDF] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models | Semantic Scholar
Favicon
[PDF] BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 | Semantic Scholar
Favicon
[PDF] BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data | Semantic Scholar
Favicon
[PDF] An Embarrassingly Simple Approach for LLM with Strong ASR Capacity | Semantic Scholar
Favicon
[PDF] AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition | Semantic Scholar
Favicon
[PDF] Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations | Semantic Scholar
Favicon
[PDF] A Full-duplex Speech Dialogue Scheme Based On Large Language Models | Semantic Scholar
Favicon
[PDF] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding | Semantic Scholar
Favicon
[2109.03264] Text-Free Prosody-Aware Generative Spoken Language Modeling
Favicon
[2406.02430] Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Favicon
[2406.09569] Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Favicon
[2305.11000] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Favicon
[2406.15718] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models
Favicon
[2401.00246] Boosting Large Language Model for Speech Synthesis: An Empirical Study