What is Pig
Pig is a program that reads files written in PigLatin to generate a MapReduce program automatically. It is generally much easier to use Pig than writing your own MapReduce programs. Grunt is the interactive shell for Pig.Ways to start Pig
Pig in interactive local mode
Runs in single virtual machineAll files are in the local file system
>pig -x local
grunt>
Pig in Interactive MapReduce mode
Runs in a Hadoop clusterIs the default mode
>pig -x
grunt>
Pig executing Pig script file in local mode
Script is written in PigLatin>pig -x local myscript.pig
Pig executing Pig script file in MapReduce mode
Script is written in PigLatin>pig myscript.pig
Pig Data
Tuple - like a row in a file, or a row in a table, but not all don't have to have the same number of items. example: They can contain scalar types such as int, chararray, double, etc, or even bags. (Jeff, {apple, orange, pear}). Parenthesis are used to indicate the tuple datatypeBag - bag of tuples. Curly braces are used to indicate the bag datatype.
Relation - an outer bag. Generally it is what you get back when you filter, group, sort, join, etc data. In terms of a database, it is kind of like a view or result set.
Data Type | Description | Example |
---|---|---|
int | signed 32-bit integer | 300 |
long | signed 64-bit integer | 300L or 300l |
float | 32-bit floating point | 3.2F, 3.2f, 3.2e2f, 3.2E2F |
double | 64-bit floating point | 3.2, 3.2e2, 3.2E2 |
chararray | a string | abcde Basically a string |
bytearray | a blob | |
tuple | ordered set of fields | (4, Brent, 388.25) Kind of like a row in a database |
bag | collection of tuples | {(4, Brent, 388.25) (20, Amanda,36.7)} kind of like multiple rows in a database. Could also be thought of as an array, list, collection, etc |
map | set of key value pairs |
Pig Latin Basics
Terminate with a semi-colon/*...*/ commend block
-- single line comment
Names of relations and fields are case sensitive
Function names are case sensitive
keywords such as LOAD, USING, AS, GROUP, BY, etc are NOT case sensitive
Loading Data
A = load '/datadir/datafile' using PigStorage('\t');
NOTE: tab is the default delimiter
NOTE: tab is the default delimiter
NOTE: If the path to a file is a directory then all files in the directory will be loaded
The default is PigStorage, but there is also BinStorage, TextLoader, JsonLoader, and you can code your own loader as well.
You can also define your schema so you can refer to fields by name (f1,f2, f3, etc)
A = load '/datadir/datafile' using PigStorage('\t') as (f1:int, f2:chararray, f3:float);
If you don't specify the schema you need to use the position. For example, $0 is the first position.
Formats: PigStorage(), BinStorage(), PigDump(), JsonStorage()
DUMP writes the results to the screen.
Boolean: and, or, not
Comparison: ==, !=, <, >, is null, is not null
Parameters are referenced using $
b = filter data by pubyear == 2014
b = order data by author ASC;
Example:
data =
(1,2,3)
(4,5,6)
(7,8,9)
(4,3,2)
myGroup = group data by f1;
Result:
(1,{(1,2,3)})
(4,{(4,5,6),(4,3,2)})
(7,{(7,8,9)})
NOTE: The relations do NOT have to have the same number of fields in them like you would in SQL.
Samples: ABS, CEIL, etc
Samples: STRSPLIT, SUBSTRING, REPLACE, REGEX_EXTRACT, REGEX_EXTRACT_ALL etc
Introduction to PIG at the Big Data University - nice training class for free. Nearly all the information above is from this class. In some cases copied.
OUTPUT
Opposite of load.Formats: PigStorage(), BinStorage(), PigDump(), JsonStorage()
DUMP writes the results to the screen.
Operators
Arthmetic: +-/*%?Boolean: and, or, not
Comparison: ==, !=, <, >, is null, is not null
Parameters
Parameters can be passed into a pig script via a parameter file or the command line.Parameters are referenced using $
Relational Operators
FILTER
Selects tuples from a relation based on some criteriab = filter data by pubyear == 2014
ORDER BY
Sorts a relation on one or more fieldsb = order data by author ASC;
FOREACH
Projects fields into a new relation. Under the hood this just is a foreach loop that loops through each of the elements in the data. For example, if you want to only return a subset of the fields. A calculation can also be done here. For example, algebra between fields.GROUP
Groups together tuples that have the same group key; the group key can be a single field or multiple fields (enclose multiple fields with parentheses). The result of a Group is a relation that contains one tuple per group. The tuple has two fields (group and value (a bag with one tuple in it)).Example:
data =
(1,2,3)
(4,5,6)
(7,8,9)
(4,3,2)
myGroup = group data by f1;
Result:
(1,{(1,2,3)})
(4,{(4,5,6),(4,3,2)})
(7,{(7,8,9)})
COGROUP
Same as GROUP operator, but by convention used when grouping multiple (up to 127) relations at the same time. Similar results to GROUP except resulting tuple has 1 group field and then one field for each relation we are cogrouping by. So if, we are cogrouping using two relationships then each resulting tuple would be (group, value for relation1, value for relation2) where relation1 and relation2 would be bags of tuples just like with the GROUP operator.Dereference
Allows us to reference a field in a tuple or bag that is outside the scope of the current operator. This can be used with the FOREACH operator.DISTINCT
Removes duplicate tuples found in a relationUNION
Merges the contents of two or relations.NOTE: The relations do NOT have to have the same number of fields in them like you would in SQL.
SPLIT
Partitions a relation into two or more relations based on some conditionCROSS
Computes the cross product of two or more relationsJOIN / INNER
Performs a join (equijoin) on two or more relations using one or more common field values. Like a SQL join.JOIN / OUTER (full, right, left)
Performs a join on two or more relations using one or more common fields. Works like you would expect if you are familiar with SQL outer joins.Evaluation Functions
Requires GROUP ALL or GROUP BY
- COUNT - Counts the number of elements in a bag
- COUNT_STAR - Computes the number of elements in a bag
- MAX - Computes the maximum value in a single-column bag
- MIN - Computes the minimum value in a single-column ba
- SUM - Computes the sum of the numeric values in a single-column bag
- AVG - Computes the average of the number values in a single-column bag
Do NOT require GROUP ALL or GROUP BY
- CONCAT - Concatenates two columns
- DIFF - Compares two fields in a tuple
- IsEmpty - Checks if a bag or map is empty
- SIZE - Computes the number of elements based on any Pig data type
- TOKENIZE - splits a string and outputs a bag of words
Math Functions
Based on Java Math classSamples: ABS, CEIL, etc
String Functions
Based on Java String classSamples: STRSPLIT, SUBSTRING, REPLACE, REGEX_EXTRACT, REGEX_EXTRACT_ALL etc
Tuple, Bag, and Map Functions
- TOTUPLE - converts one or more expressions to tuple
- TOBAG - converts one or more expressions to type bag
- TOMAP - converts pairs of expressions into a map
External Type Operators
- MAPREDUCE - Executives native MapReduce jobs inside a Pig script
- STREAM - Sends data to an external script or program
- REGISTER - Registers a JAR file so that the UDFs in the file can be used.
- fs - invokes any FSShell command from within script or the Grunt shell
- grunt > exec myscript.pig
- EXPLAIN - displays the execution plan. Used to review the logical, physical, and MapReduce execution plans
References:
Big Latin Basics- great referenceIntroduction to PIG at the Big Data University - nice training class for free. Nearly all the information above is from this class. In some cases copied.