Natural Language Inference (NLI) is a cornerstone of Natural Language
Processing (NLP), providing insights into the entailment relationships between
text pairings. It is a critical component of Natural Language Understanding
(NLU), demonstrating the ability to extract information from spoken or written
interactions. NLI is mainly concerned with determining the entailment
relationship between two statements, known as the premise and hypothesis. When
the premise logically implies the hypothesis, the pair is labeled "entailment".
If the hypothesis contradicts the premise, the pair receives the
"contradiction" label. When there is insufficient evidence to establish a
connection, the pair is described as "neutral". Despite the success of Large
Language Models (LLMs) in various tasks, their effectiveness in NLI remains
constrained by issues like low-resource domain accuracy, model overconfidence,
and difficulty in capturing human judgment disagreements. This study addresses
the underexplored area of evaluating LLMs in low-resourced languages such as
Bengali. Through a comprehensive evaluation, we assess the performance of
prominent LLMs and state-of-the-art (SOTA) models in Bengali NLP tasks,
focusing on natural language inference. Utilizing the XNLI dataset, we conduct
zero-shot and few-shot evaluations, comparing LLMs like GPT-3.5 Turbo and
Gemini 1.5 Pro with models such as BanglaBERT, Bangla BERT Base, DistilBERT,
mBERT, and sahajBERT. Our findings reveal that while LLMs can achieve
comparable or superior performance to fine-tuned SOTA models in few-shot
scenarios, further research is necessary to enhance our understanding of LLMs
in languages with modest resources like Bengali. This study underscores the
importance of continued efforts in exploring LLM capabilities across diverse
linguistic contexts.