arXiv - Artificial Intelligence·
19h ago
MAGELLAN introduces a metacognitive framework for LLM agents to efficiently prioritize learning goals in vast spaces by predicting their own competence and learning progress dynamically.
- MAGELLAN enhances LLM agents' ability to predict learning progress (LP) by capturing semantic relationships between goals, allowing for efficient goal prioritization.
- The framework enables agents to adapt to evolving goal spaces through online reinforcement learning, improving sample efficiency in LP estimation.
- Results demonstrate that MAGELLAN is the only method that allows agents to fully master large and dynamic goal spaces, showcasing its potential for scaling curriculum learning.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
SymGPT is a novel tool that integrates large language models with symbolic execution to automate the verification of smart contracts against ERC rules, significantly improving the detection of security violations.
- The paper highlights the limitations of current smart contract verification methods, which often rely on manual audits or expert tools that fail to effectively identify ERC rule violations.
- SymGPT translates ERC rules into a defined EBNF grammar and synthesizes constraints to detect potential violations, achieving a remarkable identification of 5,783 violations in real-world contracts.
- Evaluation results show that SymGPT outperforms existing automated techniques and expert auditing services, demonstrating its effectiveness in enhancing smart contract security.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
NatureLM is a sequence-based foundation model designed to integrate various scientific domains, enhancing scientific discovery through applications like drug design and material optimization.
- NatureLM is pre-trained on data from multiple scientific fields, allowing it to generate and optimize small molecules, proteins, and materials based on text instructions.
- The model supports cross-domain generation, enabling tasks such as protein-to-molecule and protein-to-RNA transformations, showcasing its versatility.
- With different sizes (1 billion to 46.7 billion parameters), NatureLM demonstrates improved performance with larger models, achieving state-of-the-art results in specific scientific tasks like SMILES-to-IUPAC translation.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This research paper presents a framework for enhancing strategic reasoning in multi-agent simulations using Large Language Models (LLMs), demonstrating their effectiveness in approximating human behavior in game-theoretic contexts.
- The study implements a role-based multi-agent framework that utilizes LLMs for strategic interactions, allowing for systematic evaluation of reasoning capabilities.
- Experiments with one-shot, 2-player beauty contests reveal that LLM-driven agents can outperform traditional economic models in mimicking human decision-making.
- An alternative semantic measure of reasoning is proposed, expanding on existing k-level theory, which enhances the understanding of strategic reasoning in artificial agents.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This research paper investigates how Large Language Models (LLMs) can learn Long Chain-of-Thought (Long CoT) reasoning effectively through structured training, emphasizing the importance of reasoning structure over content.
- The study reveals that LLMs can achieve significant performance improvements in reasoning tasks with only 17k Long CoT training samples, demonstrating data efficiency.
- Structural integrity of Long CoT is crucial; disruptions in logical consistency severely impact model accuracy, while content variations have minimal effects.
- The findings provide insights into training methodologies for future reasoning models, highlighting the need for structured reasoning approaches in LLM development.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This research paper investigates the impact of chain-of-thought (CoT) length on the reasoning capabilities of large language models (LLMs), revealing an optimal length for effective multi-step reasoning.
- The study finds that increasing CoT length initially improves reasoning accuracy but eventually leads to performance decline due to noise susceptibility.
- The authors theoretically establish an optimal CoT length, deriving a scaling law based on model capability and task difficulty.
- Experiments on synthetic and real-world datasets led to the proposal of Length-filtered Vote, a method to optimize CoT length for better reasoning outcomes.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This paper investigates inference-time computation methods for enhancing reasoning performance in large language models (LLMs) without altering model parameters. It benchmarks various strategies and establishes a standardized evaluation framework.
- The study explores diverse inference-time computation strategies, such as Best-of-N and beam search, to improve reasoning tasks without additional training.
- Extensive experiments reveal that optimizing candidate generation and reward mechanisms can significantly enhance performance, with temperature tuning improving results by up to 5%.
- A standardized benchmark is established, evaluating six methods across eight reasoning tasks, providing a foundation for future research in LLM reasoning enhancement.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This paper investigates the fluid intelligence deficiencies of Large Language Models (LLMs) through the ARC task, revealing significant limitations in their problem-solving capabilities without prior knowledge.
- LLMs excel in tasks leveraging encoded knowledge but struggle with fluid intelligence, essential for solving new problems independently.
- The study identifies three key limitations: poor skill composition, difficulty with abstract input formats, and issues with left-to-right decoding.
- Controlled experiments using the ARC task highlight these deficiencies, emphasizing the need for improved LLM capabilities in fluid intelligence assessments.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This paper introduces Harmonia, a system that leverages LLM-based reasoning to automate data harmonization, addressing challenges like schema mismatches and varying terminologies in diverse datasets.
- Harmonia combines LLM reasoning with an interactive interface to streamline the creation of reusable data harmonization pipelines, enhancing efficiency for experts.
- The system is demonstrated in a clinical data harmonization scenario, showcasing its ability to map datasets to a standard format interactively.
- The authors discuss ongoing challenges in data harmonization and propose future research directions to advance the field, emphasizing the need for innovative solutions.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This position paper discusses the necessity of integrating episodic memory into Large Language Models (LLMs) to enhance their long-term learning capabilities and adaptive behavior in dynamic environments.
- The authors propose an episodic memory framework for LLM agents, emphasizing five key properties that enable context-sensitive and adaptive behavior.
- The paper argues for a focused research effort on episodic memory, suggesting it is crucial for developing efficient long-term LLM agents.
- A roadmap is outlined to unify various research directions aimed at supporting the five properties of episodic memory, facilitating the evolution of LLMs into more capable agents.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This paper presents a systematic approach to designing efficient optimizers for large language models (LLMs) using structured Fisher information matrix approximation, proposing new optimizers that enhance memory efficiency and convergence speed.
- The study identifies that many efficient optimizers can be derived from structured Fisher information matrix approximations, leading to insights for their design.
- Two novel optimizers, Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice), are introduced, focusing on balancing efficiency and generality.
- Experimental results on LLaMA pre-training demonstrate that Alice achieves over 2x faster convergence than Adam, while RACS shows strong performance with low memory overhead on large models.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
This paper investigates the use of formal software verification, specifically the SPARK framework for Ada, to validate the correctness of code generated by large language models (LLMs). It introduces Marmaragan, a tool that generates SPARK annotations for existing programs.
- The study highlights the limitations of trusting LLM-generated code and proposes formal verification as a solution to ensure reliability.
- Marmaragan utilizes LLMs to create SPARK annotations, achieving a 50.7% success rate in generating correct annotations on benchmarked SPARK programs.
- The research establishes a foundation for integrating LLM capabilities with formal verification methods, paving the way for future advancements in software reliability.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
The paper presents LASP-2, a novel sequence parallelism method designed to enhance linear attention transformer models, improving communication and computation efficiency for very-long input sequences.
- LASP-2 optimizes communication requirements for linear attention, reducing the need for multiple communications to a single AllGather operation, enhancing scalability.
- The method shows significant performance improvements, achieving a 15.2% speed increase over its predecessor LASP and 36.6% over Ring Attention with a sequence length of 2048K.
- LASP-2H extends the approach to hybrid models, efficiently integrating both linear and standard attention layers, demonstrating versatility in model training across distributed systems.
linkCopy linkshare_windowsShare
arXiv - Artificial Intelligence·
19h ago
The paper introduces PerCul, a dataset aimed at evaluating the cultural sensitivity of large language models (LLMs) in Persian, addressing the cultural bias prevalent in LLMs trained predominantly on English data.
- PerCul features story-driven, multiple-choice questions designed to assess LLMs' understanding of culturally nuanced scenarios in Persian, curated by native annotators.
- The evaluation reveals a significant performance gap, with a 11.3% difference between the best closed-source model and a layperson baseline, increasing to 21.3% for the best open-weight model.
- This research highlights the need for culturally competent LLMs and establishes a foundation for future cross-cultural evaluations in natural language processing.
linkCopy linkshare_windowsShare
arXiv - Machine Learning·
19h ago
The paper 'DarwinLM' presents an innovative method for structured pruning of Large Language Models (LLMs) that enhances efficiency while maintaining performance, addressing the high computational costs associated with LLMs.
- DarwinLM introduces a training-aware structured pruning method that utilizes an evolutionary search process to optimize model compression and performance.
- The method generates offspring models through mutation, progressively training them and selecting the fittest, which improves efficiency in real-time applications.
- Extensive experiments on models like Llama-2-7B and Qwen-2.5-14B-Instruct demonstrate that DarwinLM achieves state-of-the-art results while requiring significantly less training data compared to existing methods.
linkCopy linkshare_windowsShare
arXiv - Machine Learning·
19h ago
Goedel-Prover is an open-source large language model that excels in automated formal proof generation, addressing the challenge of limited formalized mathematical statements through innovative training methods.
- Goedel-Prover utilizes statement formalizers to convert natural language math problems into formal language, creating a dataset of 1.64 million formal statements for training.
- The model iteratively improves by training a series of provers, each succeeding in proving statements that previous models could not, leading to enhanced proof generation capabilities.
- Achieving a 57.6% success rate on the miniF2F benchmark and ranking first on PutnamBench, Goedel-Prover significantly outperforms prior open-source models, generating nearly 30K formal proofs for Lean Workbook problems.
linkCopy linkshare_windowsShare
arXiv - Machine Learning·
19h ago
The paper presents Time2Lang, a novel framework that integrates time-series foundation models (TFMs) with large language models (LLMs) for health sensing, eliminating the need for text conversion and enhancing mental health classification tasks.
- Time2Lang directly maps TFM outputs to LLM representations, addressing inefficiencies in traditional methods that convert sensor data into text prompts.
- The framework was validated on two datasets: daily depression prediction using step count data and flourishing classification based on conversation duration, demonstrating effective integration.
- Time2Lang maintains consistent inference times regardless of input length and preserves essential time-series characteristics, marking a significant advancement in health applications of LLMs.
linkCopy linkshare_windowsShare