• Español – América Latina
  • Português – Brasil
  • Tiếng Việt

TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.

imdb_reviews

  • Description :

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Additional Documentation : Explore on Papers With Code north_east

Homepage : http://ai.stanford.edu/~amaas/data/sentiment/

Source code : tfds.datasets.imdb_reviews.Builder

  • 1.0.0 (default): New split API ( https://tensorflow.org/datasets/splits )

Download size : 80.23 MiB

Auto-cached ( documentation ): Yes

Supervised keys (See as_supervised doc ): ('text', 'label')

Figure ( tfds.show_examples ): Not supported.

imdb_reviews/plain_text (default config)

Config description : Plain text

Dataset size : 129.83 MiB

Feature structure :

  • Feature documentation :
  • Examples ( tfds.as_dataframe ):

imdb_reviews/bytes

Config description : Uses byte-level text encoding with tfds.deprecated.text.ByteTextEncoder

Dataset size : 129.88 MiB

imdb_reviews/subwords8k

Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 8k vocab size

Dataset size : 54.72 MiB

imdb_reviews/subwords32k

Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 32k vocab size

Dataset size : 50.33 MiB

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-10 UTC.

Movie Review Data

Sentiment polarity datasets.

  • polarity dataset v2.0 ( 3.0Mb) (includes README v2.0 ): 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.
  • Pool of 27886 unprocessed html files (81.1Mb) from which the polarity dataset v2.0 was derived. (This file is identical to movie.zip from data release v1.0.)
  • sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0 : 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
  • polarity dataset v1.1 (2.2Mb) (includes README.1.1 ): approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar , who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt .
  • polarity dataset v0.9 (2.8Mb) (includes a README ):. 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002. Please read the "Rating Information - WARNING" section of the README.
  • movie.zip (81.1Mb) : all html files we collected from the IMDb archive.

Sentiment scale datasets

  • Sep 30, 2009: Yanir Seroussi points out that due to some misformatting in the raw html files, six reviews are misattributed to Dennis Schwartz (29411 should be Max Messier, 29412 should be Norm Schrager, 29418 should be Steve Rhodes, 29419 should be Blake French, 29420 should be Pete Croatto, 29422 should be Rachel Gordon) and one (23982) is blank.

Subjectivity datasets

  • subjectivity dataset v1.0 (508K) (includes subjectivity README v1.0 ): 5000 subjective and 5000 objective processed sentences. Introduced in Pang/Lee ACL 2004. Released June 2004.
  • Pool of unprocessed source documents (9.3Mb) from which the sentences in the subjectivity dataset v1.0 were extracted. Note : On April 2, 2012, we replaced the original gzipped tarball with one in which the subjective files are now in the correct directory (so that the subjectivity directory is no longer empty; the subjective files were mistakenly placed in the wrong directory, although distinguishable by their different naming scheme).

If you have any questions or comments regarding this site, please send email to Bo Pang or Lillian Lee .

movie review dataset csv

IMDB movie review sentiment classification dataset

Load_data function.

Loads the IMDB dataset .

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode the pad token.

  • path : where to cache the data (relative to ~/.keras/dataset ).
  • num_words : integer or None. Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None .
  • skip_top : skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. When 0, no words are skipped. Defaults to 0 .
  • maxlen : int or None. Maximum sequence length. Any longer sequence will be truncated. None, means no truncation. Defaults to None .
  • seed : int. Seed for reproducible data shuffling.
  • start_char : int. The start of a sequence will be marked with this character. 0 is usually the padding character. Defaults to 1 .
  • oov_char : int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
  • index_from : int. Index actual words with this index and higher.
  • Tuple of Numpy arrays : (x_train, y_train), (x_test, y_test) .

x_train , x_test : lists of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words - 1 . If the maxlen argument was specified, the largest possible sequence length is maxlen .

y_train , y_test : lists of integer labels (1 or 0).

Note : The 'out of vocabulary' character is only used for words that were present in the training set but are not included because they're not making the num_words cut here. Words that were not seen in the training set but are in the test set have simply been skipped.

get_word_index function

Retrieves a dict mapping words to their index in the IMDB dataset.

The word index dictionary. Keys are word strings, values are their index.

Datasets: rotten_tomatoes like 43

Dataset card for "rotten_tomatoes", dataset summary.

Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

Supported Tasks and Leaderboards

More Information Needed

Dataset Structure

Data instances.

  • Size of downloaded dataset files: 0.49 MB
  • Size of the generated dataset: 1.34 MB
  • Total amount of disk used: 1.84 MB

An example of 'validation' looks as follows.

Data Fields

The data fields are the same among all splits.

  • text : a string feature.
  • label : a classification label, with possible values including neg (0), pos (1).

Data Splits

Reads Rotten Tomatoes sentences and splits into 80% train, 10% validation, and 10% test, as is the practice set out in

Jinfeng Li, ``TEXTBUGGER: Generating Adversarial Text Against Real-world Applications.''

Dataset Creation

Curation rationale, source data, initial data collection and normalization, who are the source language producers, annotations, annotation process, who are the annotators, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators, licensing information, citation information, contributions.

Thanks to @thomwolf , @jxmorris12 for adding this dataset.

Models trained or fine-tuned on rotten_tomatoes

Sileod/deberta-v3-small-tasksource-nli, sileod/deberta-v3-base-tasksource-nli, sileod/deberta-v3-large-tasksource-nli, pig4431/rtm_distilbert_5e, rjzauner/distilbert_rotten_tomatoes_sentiment_classifier.

movie review dataset csv

Hazqeel/electra-small-finetuned-sst2-rotten_tomatoes-distilled

movie review dataset csv

IMDB Large Movie Review Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

http://ai.stanford.edu/~amaas/data/sentiment/

Character, path to directory where data will be stored. If NULL , user_cache_dir will be used to determine path.

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

Logical, set TRUE to delete dataset.

Logical, set TRUE to return the path of the dataset.

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE .

A tibble with 25,000 rows and 2 variables:

Character, denoting the sentiment

Character, text of the review

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

When using this dataset, please cite the ACL 2011 paper

InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • huggingface/datasets -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Mr (mr movie reviews).

movie review dataset csv

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove.

movie review dataset csv

Similar Datasets

License edit, modalities edit, languages edit.

Instantly share code, notes, and snippets.

@tiangechen

tiangechen / movies.csv

  • Download ZIP
  • Star 14 You must be signed in to star a gist
  • Fork 18 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save tiangechen/b68782efa49a16edaf07dc2cdaa855ea to your computer and use it in GitHub Desktop.

This spreasheet shows you the top gross movies between 2007 and 2011. Several missing or incorrect data have been fixed in line with related sources.

Missing Data : "Leading Studio" for movie, No Reservations (2007), has been filled. "Audience score" and "Rotten Tomatoes %" for movie, Something Borrowed (2011), have been filled.

Corrections : "Worldwide Gross" for movie, Tangled (2011), have been corrected.

@efdeel

efdeel commented Dec 6, 2018 • edited

I have write some python codes to visualize and analyst this data set (with numpy, pandas, scipy & matplotlib), drop me a message to discuss or for I share my codes to anyone...

btw....this is my first post in github :)

Sorry, something went wrong.

@ourekouch

ourekouch commented Apr 21, 2019

hello efdeel , do you have any code for topic modeling abt this dataset

@PascalKu

PascalKu commented Sep 6, 2019

I think, the year of "Youth in Revolt" is wrong. I googled it and it is from Oct. 2009...

@N4877

N4877 commented Nov 15, 2021

I fetch about average rating of movies

@MahmoudMohajer

MahmoudMohajer commented Dec 5, 2021

great for anyone who wants to experiment with csv file formats.

@rorocan

rorocan commented Jan 9, 2023

vere interesting

@leandro-moraes85

leandro-moraes85 commented Jan 29, 2023

This file is also usefull for Azure training

IMAGES

  1. IMDB 5000+ Movie Dataset 分析

    movie review dataset csv

  2. Implementing a Recommendation System on IMDB Dataset through Machine

    movie review dataset csv

  3. python 3.x

    movie review dataset csv

  4. Exploring IMDB reviews in TensorFlow Datasets

    movie review dataset csv

  5. Build Recommender Systems with Movielens Dataset in Python

    movie review dataset csv

  6. IMDB dataset (Sentiment analysis) in CSV format

    movie review dataset csv

VIDEO

  1. Parineetii EP 660 Promo : Gharme Aate Hi Neeti Ne Chali Apni Nayi Chal, Rajeev Ka Pakda Haath

  2. #movie #new #account #keşfet #love #explore #latestmovie

  3. Driver Jamuna Public Review

  4. Dharmendra movie 🎥 evolution from 1960 to 2024

  5. Teaser Rompa Disappointment பண்ணிருச்சு 😰 #kanguvateaser #suriya #siruthaisiva #bobbydeol

  6. Viduthalai Part 1

COMMENTS

  1. IMDB Dataset of 50K Movie Reviews

    Large Movie Review Dataset. Large Movie Review Dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more ...

  2. IMDb Movie Reviews Dataset

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.

  3. Large Movie Review Dataset

    Sentiment Analysis. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  4. imdb_reviews

    imdb_reviews. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  5. Movie Review Data

    Movie Review Data This page is a distribution site for movie-review data for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or ...

  6. IMDB Dataset of 50K Movie Reviews

    Cannot retrieve latest commit at this time. About Dataset IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and ...

  7. Preparing IMDB Movie Review Data for NLP Experiments

    A common dataset for natural language processing (NLP) experiments is the IMDB movie review data. The goal of an IMDB dataset problem is to predict if a movie review has positive sentiment ("It was a great movie") or negative sentiment ("The film was a waste of time"). A major challenge when working with the IMDB dataset is preparing the data.

  8. IMDb Movie Reviews Dataset

    This dataset contains nearly 1 Million unique movie reviews from 1150 different IMDb movies spread across 17 IMDb genres - Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Fantasy, History, Horror, Music, Mystery, Romance, Sci-Fi, Sport, Thriller and War. The dataset also contains movie metadata such as date of release of the movie, run length, IMDb rating, movie rating (PG-13, R ...

  9. Use Sentiment Analysis With Python to Classify Movie Reviews

    Explore different ways to pass in new reviews to generate predictions. Parametrize options such as where to save and load trained models, whether to skip training or train a new model, and so on. This project uses the Large Movie Review Dataset, which is maintained by Andrew Maas. Thanks to Andrew for making this curated dataset widely ...

  10. IMDB movie review sentiment classification dataset

    Loads the IMDB dataset. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most ...

  11. rotten_tomatoes · Datasets at Hugging Face

    Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL ...

  12. Sentiment Analysis on IMDB Movie Reviews

    Notebook to train an XLNet model to perform sentiment analysis. The dataset used is a balanced collection of (50,000 - 1:1 train-test ratio) IMDB movie reviews with binary labels: postive or negative from the paper by Maas et al. (2011).The current state-of-the-art model on this dataset is XLNet by Yang et al. (2019) which has an accuracy of 96.2%.We get an accuracy of 92.2% due to the ...

  13. Movie-Review-Sentiment-Analysis/IMDB-Dataset.csv at master

    Sentiment of a movie review is predicted using three different neural network models - MLP, CNN and LSTM. GloVe embedding is used for vector representation of words. - SK7here/Movie-Review-Sentim...

  14. IMDB Large Movie Review Dataset

    The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). ... In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets ...

  15. MR Dataset

    MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

  16. IMDB Movie review.ipynb

    The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 ...

  17. movies.csv · GitHub

    movies.csv. This spreasheet shows you the top gross movies between 2007 and 2011. Several missing or incorrect data have been fixed in line with related sources. Missing Data : "Leading Studio" for movie, No Reservations (2007), has been filled. "Audience score" and "Rotten Tomatoes %" for movie, Something Borrowed (2011), have been filled.

  18. movie-review-datasets · GitHub Topics · GitHub

    Add this topic to your repo. To associate your repository with the movie-review-datasets topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  19. Rottentomatoes 400k Movie Reviews (Over 9k Movies)

    A Rottentomatoes Dataset of 400k Movie Reviews with review texts and scores. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand ...