Advanced SQL Interview Questions and Answers
Here are some complex SQL interview problems that are for people who are looking for more advanced and challenging questions, along with the answers and complete explanations. Try to figure out the answer to the questions yourself before reading the answers.
Suppose we have 2 tables called Orders and Salesperson shown below:
Salesperson | Orders | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Now suppose that we want to write SQL that must conform to the SQL standard.
We want to retrieve the names of all salespeople that have more than 1 order from the tables above. You can assume that each salesperson only has one ID.
If that is the case, then what (if anything) is wrong with the following SQL?:
SELECT Name FROM Orders, Salesperson WHERE Orders.salesperson_id = Salesperson.ID GROUP BY salesperson_id HAVING COUNT( salesperson_id ) >1
The answer and explanation to advanced SQL question 1
There is definitely something wrong with the SQL above, and it is probably something that most beginner SQL programmers may not notice. The problem is that the SQL Standard says that we can not select a column that is not part of the group by clause unless it is also contained within an aggregate function. If we try to run the SQL above in SQL Server, we would get an error that looks like this:
Column 'Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
You might be confused now, so let’s explain what that error means in plain English and through some simple examples. The most important thing you should take out of this discussion is understanding exactly why we get that error, and how to avoid it. There is a good reason for the error – read on to understand why.
You can see in the bad SQL above that the “Name” column is clearly not also a part of the group by statement, nor is it contained within an aggregate function (like SUM, MAX, etc).
As the error above suggests, we can fix the error by either wrapping the Name column inside an aggregate function or adding it to the Group By clause.
So if we want to write SQL that complies with the standard, then we could write something like this by adding the Name column to the Group By:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id, Name
-- we added the name column to the group by, and now it works!
HAVING COUNT( salesperson_id ) >1
The SQL above will run just fine without giving any error.
We could also fix the problem by putting the Name column in any aggregate function, and then simply make that a part of our select statement. So, we could just write this SQL instead, and it would be perfectly legal according to the SQL standard. We chose to use the MAX aggregate function, but any other aggregate would work just fine as well:
SELECT MAX(Name) --put name in an aggregate function
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
Adding the Name column to the group by, or wrapping the Name column in an aggregate will certainly fix the error – but it’s very important to note that both of those things will change the data that is returned to a state that you may not want.
So, now you understand how to fix the error – but do you understand why it is a problem in the first place? Well, you should – because that is the most important thing to understand! So, let’s explain some more about why SQL gives that error shown above .
First off, let’s talk a little bit more about aggregate functions. You probably know what aggregate functions in SQL are – we used one in the example above. In case you forgot, aggregate functions are used to perform a mathematical function on the values inside a given column, which is passed into the aggregate function. Here are some of the commonly used aggregate functions:
AVG() - Returns the average value COUNT() - Returns the number of rows FIRST() - Returns the first value LAST() - Returns the last value MAX() - Returns the largest value MIN() - Returns the smallest value SUM() - Returns the sum
To illustrate why the SQL standard says that a selected column has to be in the group by clause or part of an aggregate function, let’s use another example. Suppose we have some tables called Starbucks_Stores and Starbucks_Employees. In case you don’t already know, Starbucks is a popular coffee shop/cafe in the USA:
Starbucks_Employees | Starbucks_Stores | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Now, given the tables above let’s say that we write some SQL like this:
SELECT count(*) as num_employees, HourlyRate FROM Starbucks_Employees JOIN Starbucks_Stores ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id GROUP BY city
It looks like the SQL above would just return the number of Starbucks employees in each city, along with the HourlyRate – because it will group the employees based on whatever city they work in (thanks to the “group by city” statement).
The problem with selecting a non-aggregate column that is not in the group by
But the real question here is what exactly would be returned for the HourlyRate in the SQL above? Would it return every employee’s hourly rate separated by commas? Since we group by city, will it return the highest hourly rate for each city? Will it return the hourly rate as a distinct list, so those 2 guys making 11 dollars an hour will have the 11 returned only once?
The problem here is that we do not know what will be returned because we are not specific enough with what we are asking for in the SQL! If what we are asking for is not specific enough, then the SQL processor will not know what to return.
This is why almost all database implementations return an error when the SQL above is run (with the notable exception of MySQL) – and this is why the SQL does not conform to the Standard. In SQL Server running the SQL above will return the same error that we showed earlier.
Let’s explain even further in case the problem with that SQL is not crystal clear. The order of operations in which things will happen with the SQL above is:
1. The 2 tables are joined on the condition that the Starbucks_Employees.StoreID column value is equal to the Starbucks_Stores.store_id column values. 2. Groups are then created for each city - which means that each distinct city will have it's own "group". So, there will be a total of 3 groups one each for San Francisco, New York, and Los Angeles. 3. The data we are interested in is selected from each group that is created in step 2.
Because we end up with different groups based on the city, when we select a count(*), that will find the total count of rows in each and every group. But, the problem is that when we select HourlyRate, there will be multiple values for the HourlyRate within each group. For example, for the group created by the city of San Francisco there will be 4 different values for the HourlyRate – 14, 10, 11, and 13.
So the question is which value of the HourlyRate should be selected from each group? Well, it could be any one of those values – which is why that SQL results in an error. This is because what we are asking for is NOT specific enough – hopefully this is crystal clear now to you.
If the same HourlyRate were part of an aggregate function like MAX then it would simply return the highest HourlyRate within each group. And that is why having an aggregate function would fix the SQL error – because only one value will be selected from any given group.
So, this SQL is perfectly fine because we are more specific in what we ask for – but this SQL would only work for you if you actually want the highest HourlyRate for each city:
SELECT count(*) as num_employees, MAX(HourlyRate) FROM Starbucks_Employees JOIN Starbucks_Stores ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id GROUP BY city
Fix the error by adding column to the group clause
Another way to fix the error is to simply add the HourlyRate column to the group by clause. This also means that having the HourlyRate column wrapped in aggregate function is no longer necessary. So you could write some SQL like this and it would fix the error:
SELECT count(*) as num_employees, HourlyRate FROM Starbucks_Employees JOIN Starbucks_Stores ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id GROUP BY city, HourlyRate
This would then create groups based on the unique combination of the values in the HourlyRate and City columns. This means that there will be a different group for each HourlyRate and City combination – so $11, San Francisco and $11, Los Angeles will be 2 different groups. If you need to read up more on this topic then you can go here:
With the SQL above, each group will only have one value for the HourlyRate, which also means that there will be no ambiguity or confusion when selecting the HourlyRate since there is only possible value to select. It is now very clear that one and only one HourlyRate value can be returned for each group.
Adding the column to the group by clause fixes the error but will alter the data that is returned
But, one very important thing to note is that even though adding the column to the group by will fix the error, it will also change the groups that are created. This means that the data returned will be completely different from what was returned before. So, the count(*) function will no longer return the count of employees in a given city, and will instead return the number of rows in each group created by the unique combination of the HourlyRate and city columns.
MySQL – selecting non-aggregate columns not in the group by
One very important thing that you should know is that MySQL actually allows you to have non-aggregated columns in the select list even if they are not a part of the group by clause (a quick side note: a non-aggregated column is simply a column that is not wrapped within an aggregate function). What this means is that you will not receive an error if you try to run any of the “bad” SQL above in MySQL. The reason it is allowed in MySQL is because MySQL assumes that you know what you are doing – and it does actually make sense in some scenarios. For instance, let’s refer back to the SQL that we started with:
SELECT Name FROM Orders, Salesperson WHERE Orders.salesperson_id = Salesperson.ID GROUP BY salesperson_id HAVING COUNT( salesperson_id ) >1
The reason the original SQL code (presented above) works just fine in MySQL is because there is a 1 to 1 mapping of salesperson name to ID – meaning that for every unique salesperson ID there is only one possible name. Another way of saying that is that each salesperson can only have one name. So when we create groups (which is done in the “GROUP BY salesperson_id”) based on the salesperson ID, each group will only have one and only one name.
This SQL will also run just fine in MySQL without returning an error:
SELECT count(*) as num_employees, HourlyRate FROM Starbucks_Employees JOIN Starbucks_Stores ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id GROUP BY city
But, even though the code above will not return an error, the HourlyRate that is returned by MySQL will be some arbitrary (random) value within each group. This is because when we create each group based on the city, each group can have different values for the HourlyRate.
In other words, there is no one to one mapping between the HourlyRate and the city like we had before with the salesperson ID and the name. So, because we are not being specific as to which HourlyRate we want, MySQL will return an arbitrary value . For instance, in the group created by the city of San Francisco, MySQL could return the HourlyRate for any employee who works in San Francisco – whether it is 14, 10, 11, or 13 we don’t really know since it is arbitrary/random in MySQL.
That concludes part 1 of our more difficult and complex SQL questions. Click on next to check out the next question that’s a part of our advanced SQL interview questions list.
The problem is based on the tables presented below where salespeople have orders with certain customers that are in the Customers table.
Salesperson | Customer | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Orders | ||||||||||||||||||||||||||||||||||||||||
|
In the tables above, each order in the Orders table is associated with a given Customer through the cust_id foreign key column that references the ID column in the Customer table.
Here is the problem: find the largest order amount for each salesperson and the associated order number, along with the customer to whom that order belongs to. You can present your answer in any database’s SQL – MySQL, Microsoft SQL Server, Oracle, etc.
The answer to the problem and explanation
This question seems to be quite simple – but as you will soon find out it is deceptively complex. For each salesperson, all we need to retrieve is the largest order amount, and the associated order number. In order to retrieve that information we should be able to simply do a join between the Orders and Salesperson tables wherever the Salesperson.ID is equal to the Orders.salesperson_id (this would be our join predicate). Then, we could group the results of that join by the Orders.salesperson_id column and retrieve both the highest valued order (by using max(Amount)), and the associated Order number.
Let’s say that we choose to write our answer in MySQL. In MySQL we could legally write some code that looks like this:
SELECT Orders.Number, max(Amount) FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id GROUP BY Orders.salesperson_id
And, if we run that code above it will return this as a result:
Number max(Amount) 30 460 10 2400 50 720 20 1800
The problem with the data returned in MySQL
But, there’s a problem with the results returned from running the SQL above, and it should be fairly obvious once you actually look at the data in the tables we have above. Here is the problem: Order number 50 does not have an amount of “720? – that amount actually belongs to order number 60. So, what is going on here? Why do the results return an order number that is not even in the same row as the max(Amount) of 720? And why are all of the other results correct?
Understanding the group by statement is critical
Well, we will have to explain a bit more about what’s going on with the group by. You should understand exactly what the problem is with the SQL above, and you can safely skip down to the section that says “New approach to the problem – start with a subquery”. Unless you want to reinforce the concepts presented in part 1, in which case we highly recommend that you read this entire explanation to this rather difficult interview question.
When we group by the salesperson ID, there will be one group created for each and every salesperson ID. So, there will be 6 groups created – 1 for ID of 1, another for ID of 2, and others for ID’s 5, 7, 8, and 11. Inside those groups will be any rows that share the same salesperson ID values.
When we select the max(Amount), MySQL will simply look for the highest value for Amount within each group and return that value. And when we select Orders.Number, MySQL is not going to return every Orders.Number value from each group – it is only going to select one value from each group.
Our SQL is not specific enough
But, the question is which order number should be returned from each group? Each group can potentially have more than just one order number as long as there are more than one rows belonging to the group. And that is the exact problem – the SQL that we wrote is not specific enough, and MySQL will just arbitrarily/randomly return one of the values of the Orders.Number within each group. In this case, because order number 50 is part of the group created by the salesperson_id’s of 7, it will return 50. MySQL could just as well have returned order numbers 60 or 70 – the point is that it just randomly chooses one order number from each group. For the group created by salesperson ID of 2, the fact that the order number 10 is chosen (order number 10 corresponds to the largest order amount of 2400) is just pure coincidence – MySQL could have returned us order number 40, which is also a part of the same group as salesperson ID of 2.
Most relational database implementations would have thrown an error if we tried to run the SQL above because the results are potentially arbitrary, as we just illustrated. MySQL is the exception, because it allows us to run the SQL above error-free, but as we illustrated the data returned could potentially not make any sense.
Well, now we know that there is definitely an issue with the SQL above, so how can we write a good query that would give us exactly what we want – along with the correct order number?
New approach to the problem – start with a subquery
Now let’s instead just try to break the problem down into more manageable pieces – starting with a simple subquery. Here is a subquery to get the highest valued order for each salesperson:
SELECT salesperson_id, MAX(Amount) AS MaxOrder FROM Orders GROUP BY salesperson_id
Running the query above will return this:
salesperson_id | MaxOrder |
---|---|
1 | 460 |
2 | 2400 |
7 | 720 |
8 | 1800 |
The query above gives us the salesperson_id and that salesperson’s associated highest order amount – but it still does not give us the order number associated with the highest order amount. So, how can we find the order number as well?
Clearly we need to do something else with the subquery we have above that will also give us the correct order number. What are our options? Try to come up with an answer on your own before reading on.
Well, we can do a join with the results of the subquery above. But, on what condition should our join be done and what exactly should we be joining the subquery above with?
What if we join our subquery above with data from the Orders table, where the join is done on the basis that the salesperson_id matches, AND that the value in the Order table’s Amount column is equal to the amount (MaxOrder) returned from the subquery? This way, we can match up the correct Order Number with the correct corresponding value for the maximum Order Amount for a given salesperson_id.
With that in mind, we can write this query:
select salesperson_id, Number as OrderNum, Amount from Orders JOIN ( -- this is our subquery from above: SELECT salesperson_id, MAX(Amount) AS MaxOrder FROM Orders GROUP BY salesperson_id ) as TopOrderAmountsPerSalesperson USING (salesperson_id) where Amount = MaxOrder
Running the query above returns us this:
salesperson_id | OrderNum | Amount |
---|---|---|
8 | 20 | 1800 |
1 | 30 | 460 |
2 | 10 | 2400 |
7 | 60 | 720 |
How does the query work exactly?
How does the query above work exactly? It’s actually pretty simple. First, the subquery (which is basically a derived table here, named TopOrderAmountsPerSalesperson) returns the orders with the highest dollar amounts per salesperson, and the associated salesperson ID. So, now we have each salesperson’s highest valued order and his/her ID in a derived table. That derived table (the results from the subquery) is then joined with the entire Orders table on the condition that the salesperson ID matches and that the Amount from the Orders table matches the MaxOrder amount returned from the derived table. What’s the point of this? Well, that join will give us the correct OrderNumber since it is matching on both the salesperson ID and the amount. Even if there are 2 rows with the same exact salesperson ID and amount it will not even matter because no matter which ordernumber is associated with that row, the result set will be exactly the same.
And remember that the whole reason we are doing this is to avoid the original problem with not being able to select a non-aggregated column with a group by.
Now, retrieving the salesperson name is simple. Try to figure it out on your own.
Here is how we retrieve the salesperson name – we just use another join with the Salesperson table and select the Name:
SELECT salesperson_id, Name, Orders.Number AS OrderNumber, Orders.Amount FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id JOIN ( SELECT salesperson_id, MAX( Amount ) AS MaxOrder FROM Orders GROUP BY salesperson_id ) AS TopOrderAmountsPerSalesperson USING ( salesperson_id ) WHERE Amount = MaxOrder
Running the query above returns this:
salesperson_id | Name | OrderNumber | Amount |
---|---|---|---|
1 | Abe | 30 | 460 |
2 | Bob | 10 | 2400 |
7 | Dan | 60 | 720 |
8 | Ken | 20 | 1800 |
And, finally we have our answer! But one last thing – let’s check for corner cases. What would happen if we add one more row to the table where a given salesperson has 2 or more orders that have the same value for the highest amount? For example, let’s add this row to the Orders table:
Number | order_date | cust_id | salesperson_id | Amount |
---|---|---|---|---|
80 | 02/19/94 | 7 | 2 | 2400 |
This now means that the salesperson with an ID of 2 has 2 orders with an amount of 2400 in the Orders table. And, if we run the SQL above again, we will get this as a result (note the extra row for Bob):
salesperson_id | Name | OrderNumber | Amount |
---|---|---|---|
1 | Abe | 30 | 460 |
2 | Bob | 40 | 2400 |
7 | Dan | 60 | 720 |
8 | Ken | 20 | 1800 |
2 | Bob | 80 | 2400 |
Now, the question is if we only want one of Bob’s orders to show up, how can we eliminate the duplicate?
Again, try to figure this out on your own before reading our answer.
Again, try to figure this out on your own before reading our answer.
Well, we could add a GROUP BY salesperson_id, Amount to the end of the query, which would create separate groups for each unique combination of the salesperson ID and the Amount. This would give us a query that looks like this:
SELECT salesperson_id, Salesperson.Name, Number AS OrderNumber, Amount FROM Orders JOIN Salesperson ON Salesperson.ID = Orders.salesperson_id JOIN ( SELECT salesperson_id, MAX( Amount ) AS MaxOrder FROM Orders GROUP BY salesperson_id ) AS TopOrderAmountsPerSalesperson USING ( salesperson_id ) WHERE Amount = MaxOrder GROUP BY salesperson_id, Amount
Now, running this query even with the duplicate row in the Orders table would return us this:
salesperson_id | Name | OrderNumber | Amount |
---|---|---|---|
1 | Abe | 30 | 460 |
2 | Bob | 40 | 2400 |
7 | Dan | 60 | 720 |
8 | Ken | 20 | 1800 |