To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.
# Open the interactive hive console hive --hiveconf tez.queue.name=<your_queue> # Create a table with the Google NGrams data in /var/ngrams CREATE EXTERNAL TABLE ngrams_<your_uniqname>(ngram STRING, year INT, count BIGINT, volumes BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/var/ngrams'; # Look at the schema of the table DESCRIBE ngrams_<your_uniqname>; # Count the total number of rows (should be 1430731493) SELECT COUNT(*) FROM ngrams_<your_uniqname>; # Select the number of words, by year, that have only appeared in a single volume SELECT year, COUNT(ngram) FROM ngrams_<your_uniqname> WHERE volumes = 1 GROUP BY year; # Optional: delete your ngrams table DROP table ngrams_<your_uniqname>; # Exit the Hive console QUIT;
The last few lines of output should look something like this:
More information can be found on the Apache website.