Wikipedia articles
Data Science and Analytics
Related Searches
Trusted By




"No reviews yet"
£190
About
Access a wealth of information, including article titles, raw text, images, and structured references. Popular use cases include knowledge extraction, trend analysis, and content development.
Use our Wikipedia Articles dataset to access a vast collection of articles across a wide range of topics, from history and science to culture and current events. This dataset offers structured data on articles, categories, and revision histories, enabling deep analysis into trends, knowledge gaps, and content development.
Tailored for researchers, data scientists, and content strategists, this dataset allows for in-depth exploration of article evolution, topic popularity, and interlinking patterns. Whether you are studying public knowledge trends, performing sentiment analysis, or developing content strategies, the Wikipedia Articles dataset provides a rich resource to understand how information is shared and consumed globally.
Dataset Features
- url: Direct URL to the original Wikipedia article.
- title: The title or name of the Wikipedia article.
- table_of_contents: A list or structure outlining the article's sections and hierarchy.
- raw_text: Unprocessed full text content of the article.
- cataloged_text: Cleaned and structured version of the article’s content, optimized for analysis.
- images: Links or data on images embedded in the article.
- see_also: Related articles linked under the “See Also” section.
- references: Sources cited in the article for credibility.
- external_links: Links to external websites or resources mentioned in the article.
- categories: Tags or groupings classifying the article by topic or domain.
- timestamp: Last edit date or revision time of the article snapshot.
Distribution
- Data Volume: 11 Columns and 2.19 M Rows
- Format: CSV
Usage
This dataset supports a wide range of applications:
- Knowledge Extraction: Identify key entities, relationships, or events from Wikipedia content.
- Content Strategy & SEO: Discover trending topics and content gaps.
- Machine Learning: Train NLP models (e.g., summarisation, classification, QA systems).
- Historical Trend Analysis: Study how public interest in topics changes over time.
- Link Graph Modeling: Understand how information is interconnected.
Coverage
- Geographic Coverage: Global (multi-language Wikipedia versions also available)
- Time Range: Continuous updates; snapshots available from early 2000s to present.
License
CUSTOM
Please review the respective licenses below:
- Data Provider's License
Who Can Use It
- Data Scientists: For training or testing NLP and information retrieval systems.
- Researchers: For computational linguistics, social science, or digital humanities.
- Businesses: To enhance AI-powered content tools or customer insight platforms.
- Educators/Students: For building projects, conducting research, or studying knowledge systems.
Suggested Dataset Names
- Wikipedia Corpus+
- Wikipedia Stream Dataset
- Wikipedia Knowledge Bank
- Open Wikipedia Dataset
Pricing
Based on Delivery frequency
~Up to $0.0025 per record. Min order $250
Approximately 283 new records are added each month.
Approximately 1.12M records are updated each month.
Get the complete dataset each delivery, including all records.
Retrieve only the data you need with the flexibility to set Smart Updates.
- Monthly
New snapshot each month, 12 snapshots/year
Paid monthly
- Quarterly
New snapshot each quarter, 4 snapshots/year
Paid quarterly
- Bi-annual
New snapshot every 6 months, 2 snapshots/year
Paid twice-a-year
- One-time purchase
New snapshot one-time delivery
Paid once