Hive

To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.

# Open the interactive hive console
hive --hiveconf tez.queue.name=<your_queue>

# Create a table with the Google NGrams data in /var/ngrams
CREATE EXTERNAL TABLE ngrams_<your_uniqname>(ngram STRING, year INT, count BIGINT, volumes BIGINT)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY '\t'
     STORED AS TEXTFILE
     LOCATION '/var/ngrams';

# Look at the schema of the table
DESCRIBE ngrams_<your_uniqname>;

# Count the total number of rows (should be 1430731493)
SELECT COUNT(*) FROM ngrams_<your_uniqname>;

# Select the number of words, by year, that have only appeared in a single volume
SELECT year, COUNT(ngram) FROM ngrams_<your_uniqname> WHERE 
volumes = 1
GROUP BY year;

# Optional: delete your ngrams table
DROP table ngrams_<your_uniqname>;

# Exit the Hive console
QUIT;

The last few lines of output should look something like this:

More information can be found on the Apache website.

Leave a Reply

Next Post