Back to List
Open SourceFUTODatasetOpen Source

FUTO Releases Comprehensive Open-Source Dataset of One Million English Swipes for Mobile Input Development

FUTO has announced the release of a significant dataset containing over one million QWERTY English swipes, now available on HuggingFace under the MIT license. The collection process began in August 2024, utilizing a voluntary mobile-based platform where users swiped Wikipedia-sourced sentences word-by-word. After filtering for quality, the final dataset was released in March 2025. This initiative aims to improve swipe typing models and provide a robust benchmark for evaluating different typing systems. FUTO utilized this data extensively to refine its own models, marking a major contribution to open-source mobile input technology and linguistic data accessibility. By providing this data under a permissive license, FUTO enables developers to enhance mobile keyboard accuracy and performance.

Hacker News

Key Takeaways

  • Massive Scale: The dataset contains over 1 million high-quality QWERTY English swipes collected from voluntary users.
  • Open Source Accessibility: Released under the MIT license, the data is freely available on HuggingFace for developers and researchers.
  • Rigorous Methodology: Data was collected word-by-word using Wikipedia sentences and underwent a filtering process to ensure quality.
  • Practical Application: FUTO has already utilized this dataset to train its own models and evaluate various swipe typing systems.
  • Timeline: The project spanned from initial collection in August 2024 to the public release in March 2025.

In-Depth Analysis

The Lifecycle of the FUTO Swipe Dataset

The development of the FUTO Swipe dataset represents a multi-stage effort to improve mobile input technology. The initiative began in August 2024 with the launch of a dedicated collection domain, swipe.futo.org. This platform was specifically designed for mobile users to contribute QWERTY English swipes. The process was built on a foundation of user consent; participants were provided with detailed instructions and information about the dataset before agreeing to contribute. This transparent approach ensured that the data collected was both ethical and focused on the specific needs of swipe typing models.

Between the start of collection and the eventual release, the project focused on gathering a diverse range of inputs. Users were presented with sentences primarily sourced from Wikipedia, which provided a broad vocabulary and varied sentence structures. The specific instruction to swipe "word-by-word" allowed for a more granular and accurate mapping of swipe gestures to specific English words. By March 2025, the effort had resulted in over 1 million swipes, which were then subjected to a filtering process. This quality control phase was essential to remove low-quality or erroneous swipes, ensuring that the final dataset would be a reliable resource for machine learning applications.

Methodology and Data Integrity

The methodology employed by FUTO highlights a commitment to data integrity and practical utility. By using a web-based mobile interface, FUTO was able to capture swipes in a naturalistic environment—on the actual devices where swipe typing is used. The choice of Wikipedia as the primary text source ensured that the dataset covered a wide array of common and technical English terms, making the resulting models more robust for general-purpose typing.

The decision to release the dataset under the MIT license is a significant move for the open-source community. By hosting the 1 million swipes on HuggingFace, FUTO has made the data easily accessible to the global research community. This level of accessibility is crucial for the advancement of mobile input systems, as it allows multiple parties to evaluate different swipe typing architectures against the same high-quality benchmark. FUTO's own use of the data to train and evaluate its models serves as a proof of concept for the dataset's effectiveness in improving gesture-based text entry.

Industry Impact

The release of the FUTO Swipe dataset has several implications for the AI and mobile technology industries. First, it addresses a common bottleneck in the development of mobile keyboards: the lack of large-scale, open-source gesture data. While proprietary datasets exist, the availability of a 1-million-swipe dataset under the MIT license levels the playing field for independent developers and smaller tech firms.

Furthermore, the dataset provides a standardized way to evaluate swipe typing systems. By using the same data for training and testing, the industry can more accurately compare the performance of different algorithms. This transparency can lead to faster iterations and improvements in swipe typing accuracy, speed, and user experience. FUTO’s contribution reinforces the importance of open data in driving innovation within the niche but essential field of mobile human-computer interaction.

Frequently Asked Questions

Question: What is the licensing for the FUTO Swipe dataset?

The dataset is released under the MIT license, which allows for broad use, modification, and distribution in both open-source and commercial projects.

Question: Where can developers access the dataset?

The dataset of 1 million swipes is currently available for download on HuggingFace, making it easy to integrate into existing machine learning workflows.

Question: How was the data quality ensured during collection?

FUTO implemented a filtering process to remove a small set of low-quality swipes that were identified after the initial collection phase, ensuring the final 1 million swipes met a high standard for training and evaluation.

Related News

Meituan Open Sources AIGC Poster Generation Framework: Analyzing the Generation-Editing-Evaluation Technical Loop
Open Source

Meituan Open Sources AIGC Poster Generation Framework: Analyzing the Generation-Editing-Evaluation Technical Loop

Meituan's Intelligent Creation Team has officially unveiled and open-sourced its comprehensive technical system for AIGC-driven poster generation. The framework is built upon a sophisticated "Generation-Editing-Evaluation" closed loop, designed to bridge the gap between raw AI output and production-ready commercial assets. Currently deployed within Meituan Waimai and various Brand IP scenarios, this system addresses the practical challenges of automated design by integrating creative generation with precise editing tools and automated quality assessment. By open-sourcing the entire technical stack, Meituan aims to provide the developer community with a proven, industrial-grade solution for scalable visual content creation. This move signifies a major step in the practical application of AIGC within the food delivery and digital branding sectors, offering a structured approach to maintaining design quality at scale.

Meituan Open-Sources LongCat-Video-Avatar 1.5: Advancing Digital Human Video Generation for Commercial Use
Open Source

Meituan Open-Sources LongCat-Video-Avatar 1.5: Advancing Digital Human Video Generation for Commercial Use

Meituan's technical team has officially open-sourced LongCat-Video-Avatar 1.5, marking a significant transition from experimental state-of-the-art (SOTA) research to practical, commercial-grade digital human video generation. This major update introduces comprehensive improvements in lip-sync accuracy, physical plausibility, and long-video stability. Furthermore, the model now supports multi-person interactions and features optimized inference efficiency. Designed to handle complex commercial environments, LongCat-Video-Avatar 1.5 aims to provide stable, natural, and high-quality content, effectively moving digital human technology from controlled laboratory settings to diverse, real-world applications. The release emphasizes a shift toward "thousand people, thousand faces" personalization in the digital human landscape.

LongCat-Flash-Prover: Meituan Open-Sources AI Model for Rigorous Mathematical Theorem Proving and Formalization
Open Source

LongCat-Flash-Prover: Meituan Open-Sources AI Model for Rigorous Mathematical Theorem Proving and Formalization

The Meituan technical team has announced the open-source release of LongCat-Flash-Prover, a specialized AI model designed to tackle the complexities of mathematical formalization and theorem proving. Unlike conventional AI models that focus primarily on achieving correct numerical outputs, LongCat-Flash-Prover is built to maintain rigorous logical chains required for formal verification. The project addresses a fundamental challenge in AI reasoning: the inherent ambiguity of natural language, which can lead to the failure of complex mathematical proofs. By prioritizing formalization over simple answer-guessing, Meituan aims to provide a tool that ensures every step of a mathematical argument is logically sound. This release marks a significant contribution to the open-source community, specifically targeting the transition from intuitive AI responses to verifiable mathematical rigor.