Difference between a theta join, equijoin and natural join

Solution 1:

A theta join allows for arbitrary comparison relationships (such as ≥).

An equijoin is a theta join using the equality operator.

A natural join is an equijoin on attributes that have the same name in each relationship.

Additionally, a natural join removes the duplicate columns involved in the equality comparison so only 1 of each compared column remains; in rough relational algebraic terms: ⋈ = πR,S-as ○ ⋈aR=aS

Solution 2:

While the answers explaining the exact differences are fine, I want to show how the relational algebra is transformed to SQL and what the actual value of the 3 concepts is.

The key concept in your question is the idea of a join. To understand a join you need to understand a Cartesian Product (the example is based on SQL where the equivalent is called a cross join as onedaywhen points out);

This isn't very useful in practice. Consider this example.

Product(PName, Price)
====================
Laptop,   1500
Car,      20000
Airplane, 3000000


Component(PName, CName, Cost)
=============================
Laptop, CPU,    500
Laptop, hdd,    300
Laptop, case,   700
Car,    wheels, 1000

The Cartesian product Product x Component will be - bellow or sql fiddle. You can see there are 12 rows = 3 x 4. Obviously, rows like "Laptop" with "wheels" have no meaning, this is why in practice the Cartesian product is rarely used.

|    PNAME |   PRICE |  CNAME | COST |
--------------------------------------
|   Laptop |    1500 |    CPU |  500 |
|   Laptop |    1500 |    hdd |  300 |
|   Laptop |    1500 |   case |  700 |
|   Laptop |    1500 | wheels | 1000 |
|      Car |   20000 |    CPU |  500 |
|      Car |   20000 |    hdd |  300 |
|      Car |   20000 |   case |  700 |
|      Car |   20000 | wheels | 1000 |
| Airplane | 3000000 |    CPU |  500 |
| Airplane | 3000000 |    hdd |  300 |
| Airplane | 3000000 |   case |  700 |
| Airplane | 3000000 | wheels | 1000 |

JOINs are here to add more value to these products. What we really want is to "join" the product with its associated components, because each component belongs to a product. The way to do this is with a join:

Product JOIN Component ON Pname

The associated SQL query would be like this (you can play with all the examples here)

SELECT *
FROM Product
JOIN Component
  ON Product.Pname = Component.Pname

and the result:

|  PNAME | PRICE |  CNAME | COST |
----------------------------------
| Laptop |  1500 |    CPU |  500 |
| Laptop |  1500 |    hdd |  300 |
| Laptop |  1500 |   case |  700 |
|    Car | 20000 | wheels | 1000 |

Notice that the result has only 4 rows, because the Laptop has 3 components, the Car has 1 and the Airplane none. This is much more useful.

Getting back to your questions, all the joins you ask about are variations of the JOIN I just showed:

Natural Join = the join (the ON clause) is made on all columns with the same name; it removes duplicate columns from the result, as opposed to all other joins; most DBMS (database systems created by various vendors such as Microsoft's SQL Server, Oracle's MySQL etc. ) don't even bother supporting this, it is just bad practice (or purposely chose not to implement it). Imagine that a developer comes and changes the name of the second column in Product from Price to Cost. Then all the natural joins would be done on PName AND on Cost, resulting in 0 rows since no numbers match.

Theta Join = this is the general join everybody uses because it allows you to specify the condition (the ON clause in SQL). You can join on pretty much any condition you like, for example on Products that have the first 2 letters similar, or that have a different price. In practice, this is rarely the case - in 95% of the cases you will join on an equality condition, which leads us to:

Equi Join = the most common one used in practice. The example above is an equi join. Databases are optimized for this type of joins! The oposite of an equi join is a non-equi join, i.e. when you join on a condition other than "=". Databases are not optimized for this! Both of them are subsets of the general theta join. The natural join is also a theta join but the condition (the theta) is implicit.

Source of information: university + certified SQL Server developer + recently completed the MOO "Introduction to databases" from Stanford so I dare say I have relational algebra fresh in mind.