audio/multimodal papers

audio/multimodal papers

A Folder from Siddish

LAION-AI/CLAP: Contrastive Language-Audio Pretraining

Sreyan88/GAMA: Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

THUDM/GLM-4: GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型

[2306.12925] AudioPaLM: A Large Language Model That Can Speak and Listen

[2402.13236] Towards audio language modeling -- an overview

[2308.12792] Sparks of Large Audio Models: A Survey and Outlook

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data - Amazon Science

[2309.11000] Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

lzw-lzw/GroundingGPT: [ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model

[2405.05945] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

[2310.08715] Toward Joint Language Modeling for Speech Units and Text

kyegomez/CM3Leon: An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

[2402.12654] OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

arxiv.org/pdf/2211.16866

arxiv.org/pdf/2407.09732

aask1357/hilcodec: High fidelity, lightweight, end-to-end, streaming, convolution-based neural audio codec

zhenye234/xcodec: X-Codec: Unified Audio Tokenizer for Audio Language Model

X-LANCE/SLAM-LLM: Speech, Language, Audio, Music Processing with Large Language Model

Linear95/SPAG: Self-playing Adversarial Language Game Enhances LLM Reasoning

[2405.04752] HILCodec: High Fidelity and Lightweight Neural Audio Codec

[2405.00233] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

[2401.03497] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

[2310.04673] LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

[2310.13289] SALMONN: Towards Generic Hearing Abilities for Large Language Models

[2309.07937] Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

[2310.00704] UniAudio: An Audio Foundation Model Toward Universal Audio Generation

[2310.00230] SLM: Bridge the thin gap between speech and text foundation models

[2402.01831] Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

[2402.05755] SpiRit-LM: Interleaved Spoken and Written Language Model

[2402.12226] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

[2402.08846] An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

[2404.00656] WavLLM: Towards Robust and Adaptive Speech Large Language Model

[2305.13009] Textually Pretrained Speech Language Models

[2307.03917] On decoder-only architecture for speech-to-text and large language model integration

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Proceedings of the AAAI Conference on Artificial Intelligence

openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf

[2305.11000] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

[2306.12925] AudioPaLM: A Large Language Model That Can Speak and Listen

proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf

Flamingo: a Visual Language Model for Few-Shot Learning

ViTAE-Transformer/QFormer: The official repo for [TPAMI'23] "Vision Transformer with Quadrangle Attention"

AudioLLMs/AudioLLM: Audio Large Language Models

[PDF] BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing | Semantic Scholar

[PDF] A Full-duplex Speech Dialogue Scheme Based On Large Language Models | Semantic Scholar

[PDF] AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Semantic Scholar

[PDF] Audio-visual training for improved grounding in video-text LLMs | Semantic Scholar

[PDF] BLSP-Emo: Towards Empathetic Large Speech-Language Models | Semantic Scholar

[PDF] Boosting Large Language Model for Speech Synthesis: An Empirical Study | Semantic Scholar

[PDF] Connecting Speech Encoder and Large Language Model for ASR | Semantic Scholar

[PDF] DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment | Semantic Scholar

[PDF] Transferable speech-to-text large language model alignment module | Semantic Scholar

[PDF] SpeechVerse: A Large-scale Generalizable Audio Language Model | Semantic Scholar

[PDF] Sparks of Large Audio Models: A Survey and Outlook | Semantic Scholar

[PDF] WavLLM: Towards Robust and Adaptive Speech Large Language Model | Semantic Scholar

[PDF] SALMONN: Towards Generic Hearing Abilities for Large Language Models | Semantic Scholar

[PDF] DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding | Semantic Scholar

[PDF] Language Model Can Listen While Speaking | Semantic Scholar

Language Model Can Listen While Speaking

arxiv.org/pdf/2203.16502

arxiv.org/pdf/2408.02622

[2405.19487] A Full-duplex Speech Dialogue Scheme Based On Large Language Models

[2406.15718] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

[PDF] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models | Semantic Scholar

[PDF] BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 | Semantic Scholar

[PDF] BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data | Semantic Scholar

[PDF] An Embarrassingly Simple Approach for LLM with Strong ASR Capacity | Semantic Scholar

[PDF] AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition | Semantic Scholar

[PDF] Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations | Semantic Scholar

[PDF] A Full-duplex Speech Dialogue Scheme Based On Large Language Models | Semantic Scholar

[PDF] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding | Semantic Scholar

[2109.03264] Text-Free Prosody-Aware Generative Spoken Language Modeling

[2406.02430] Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

[2406.09569] Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

[2305.11000] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

[2406.15718] Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

[2401.00246] Boosting Large Language Model for Speech Synthesis: An Empirical Study