Pig, often referred to as Pig Latin, was initially conceived as a high-level abstraction language designed for crafting scripts for MapReduce in a streamlined manner. Its core intent is to offer a user-friendly mechanism for developing potent analytical query scripts in a structured format.
In Polypheny, a unique adaptation of the Apache Pig is implemented, which substitutes the traditionally used files with database entities. As a consequence of this modification, Pig’s capabilities within Polypheny are confined to querying, devoid of any provision for materializing the acquired values into files.
Fundamentals
Pig can be applied directly through the Polypheny-UI console. In addition, Polypheny also offers an HTTP Interface, capable of managing Pig queries. After activating this interface, you can submit a Pig query using HTTP POST requests to the designated port (default: 13137) and the associated route (/pig
). The query should be embedded in the body of the POST request, and the output is returned in the form of JSON.
Pig in Polypheny adopts database entities instead of physical files. As such, a standard Pig query like A = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') as (id:int,name:chararray,city:chararray);
is simplified to A = LOAD 'student';
in Polypheny (Given that the ‘student’ entity contains the same data as the student.txt
file).
Strings and Identifiers
Pig in Polypheny employs either single '
or double "
quotation marks to denote strings and identifiers.
Query Execution Order
Polypheny’s implementation of Pig is designed to process a single query or a chain of linked queries at once, implying that the result set associated with the last variable is retrieved.
Default Schema
Currently, Pig only supports querying the default schema, usually known as “public”. This aspect might be expanded in future updates.
Using Pig Latin in Polypheny
Pig Latin is a high-level language that is designed for processing and analyzing large datasets. In Polypheny, we have adapted Pig Latin to provide a powerful toolset for manipulating your data in a more intuitive and less cumbersome way than raw SQL.
Basic Structure of a Pig Latin Query
The essential structure of a Pig Latin query is a series of data transformations applied on structured input data to produce the desired output. Each step in a Pig Latin script involves a transformation that consumes one or more inputs and generates an output which is used in the next step of the pipeline.
A = LOAD 'student';
B = FILTER A BY age >= 18;
DUMP B;
In the above script, the LOAD
operation reads data from the ‘student’ entity and assigns it to the relation A
. The FILTER
operation then filters this data to only include students who are 18 years or older, and stores the result in relation B
. Finally, DUMP
prints the contents of relation B
.
Variables and Relations
In Pig Latin, variables are used to store relations, where a relation is a bag (collection) of tuples (rows) and each tuple can contain multiple fields (values).
A = LOAD 'student';
In this example, A
is a variable that holds the result of the LOAD
operation.
Variables in Pig Latin have several important characteristics:
- Immutability: Variables are immutable. Once assigned a relation, it cannot be re-assigned. However, you can use the variable in further operations to create new relations and variables.
A = LOAD 'student'; B = FILTER A BY age >= 18;
In this example, the variable
A
remains unchanged by theFILTER
operation. Instead, a new variableB
is created to hold the result. -
Scoping: Variables in Pig are global and can be used anywhere in the script following their declaration.
- Order of Declaration: The order of variable declaration is important. A variable needs to be declared before it can be used in a script.
Execution and Result Retrieval
In Polypheny’s implementation of Pig Latin, a script is completed by either a DUMP
operation or by returning the latest defined variable as the result set.
A = LOAD 'student';
DUMP A;
In the above script, DUMP A
prints the content of variable A
to the console. If there is no DUMP
in the script, the content of the most recently defined variable is returned as the result set.
Supported Operations
The version of Pig Latin supported in Polypheny contains a subset of all Pig operations. The following operations and transformations are supported:
LOAD
The LOAD
operation is used to read a database entity into a relation. A relation, represented by a variable in Pig Latin, stores the contents of the database entity.
Example:
A = LOAD 'student';
In this example, A
is a relation that holds the contents of the ‘student’ entity.
DISTINCT
The DISTINCT
operation is used to remove duplicate rows from a relation. It outputs a new relation containing only the unique tuples from the input relation.
Example:
A = LOAD 'student';
B = DISTINCT A;
In this example, B
is a new relation containing the unique tuples from relation A
.
LIMIT
The LIMIT
operation is used to restrict the number of output rows. This can be useful when dealing with large data and only a certain number of records are needed.
Example:
A = LOAD 'student';
B = LIMIT A 10;
In this example, B
is a new relation containing the first 10 tuples from relation A
.
ORDER BY
The ORDER BY
operation is used to sort the data in a relation based on a specific field in either ascending (ASC
) or descending (DSC
) order.
Example:
A = LOAD 'student';
B = ORDER A BY name ASC;
In this example, B
is a new relation containing the tuples from relation A
, sorted by the ‘name’ field in ascending order.
FOREACH GENERATE
The FOREACH GENERATE
operation applies transformations to each tuple of a relation and generates a new relation. This can be used to select specific fields from a relation.
Example:
A = LOAD 'student';
B = FOREACH A GENERATE name;
In this example, B
is a new relation containing only the ‘name’ field from each tuple in relation A
.
FILTER
The FILTER
operation is used to filter the data in a relation based on a specified condition. The conditions can be equality (==
), less than (<
), greater than (>
), less than or equal to (<=
), greater than or equal to (>=
), or not equal to (!=
).
Example:
A = LOAD 'student';
B = FILTER A BY age > 18;
In this example, B
is a new relation containing only the tuples from relation A
where the ‘age’ field is greater than 18.
GROUP BY
The GROUP BY
operation is used to group the data in a relation based on a specific field. This can be useful for subsequent aggregate operations.
Example:
A = LOAD 'student';
B = GROUP A BY age;
In this example, B
is a new relation where the tuples from relation A
are grouped by the ‘age’ field.
Example Queries
Assume we have a “public” schema with the entities “employee” and “dept” defined as follows:
CREATE TABLE public.dept(
deptno TINYINT NOT NULL,
deptname VARCHAR(30) NOT NULL,
PRIMARY KEY (deptno)
);
CREATE TABLE public.employee(
empno INTEGER NOT NULL,
empname VARCHAR(20) NOT NULL,
salary REAL NOT NULL,
deptno TINYINT NOT NULL,
married BOOLEAN NOT NULL,
dob DATE NOT NULL,
joining_date DATE NOT NULL,
PRIMARY KEY (empno)
);
To find an employee named “Loki”:
A = LOAD 'employee';
B = FILTER A BY empname == 'Loki';
To find all employees from the HR or IT department:
A = LOAD 'dept';
B = FILTER A BY deptname == 'HR' OR deptname == "IT";
To find all employees from departments other than HR:
A = LOAD 'dept';
B = FILTER A BY NOT(deptname == 'HR');
To get all employee names sorted by date of birth and grouped by their marital status:
A = LOAD 'employee';
B = ORDER A BY dob ASC;
C = GROUP B BY married;