This dataset is probably preferable for sentiment analysis type tasks.
aggressively deduplicated data (18gb)
No duplicates whatsoever (82.83 million reviews). file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews.
