2016-01-02

Optimising your LINQ usage

One of the great things about the .Net framework is the ability to use LINQ on generic collections, but it is important to understand what the runtime is actually doing when you use it. One of the problems with LINQ is that it looks very similar to SQL so it is easy for inexperienced developers to think that it is a black box that will be optimised somewhere. This can be further confused when using LINQ to SQL as in this case it is SQL underlying the operations.

Unfortunately there is no underlying optimisation with LINQ, it is mostly syntactic sugar that is converted to for/while/foreach loops in the CIL. I’m going to provide a few examples of optimisations that can be made very easily and dramatically improve your system performance.

This post is about LINQ to objects or more specifically LINQ to enumerables. LINQ to SQL has its own optimisations and there are plenty of articles on the internet about LINQ to SQL optimisation so I’m not going to discuss that here.

Many sequential Where operations

See the following example of some inefficient LINQ usage.

List<Employee> employees = GetSomeEmployeesOrSomething();
var employeesToPay = employees.Where(employee => employee.IsPayrollToday())
                                  .Where(employee => employee.PayrollEnabled)
                                  .Where(employee => !employee.IsTerminated);

What happens here is that one complete iteration is made over the employee list to get the employees whose payroll is today. Then another iteration over those employees to find the ones whose payroll is enabled. Then another one from the remaining employees to find the ones who have not been terminated.

Realistically this filter operations should be an O(n) operation but we are performing two additional iterations unnecessarily.

An improvement to this is quite obvious.

List<Employee> employees = GetSomeEmployeesOrSomething();
var employeesToPay = employees.Where(employee => employee.PayDayIsMonday && employee.PayDayIsTuesday && employee.PayDayIsWednesday);

Now it is an O(n) operation but we have committed possibly the worst crime of all - we have improved performance at the price of readability.

So a neater way to improve this might be to move it from a lambda expression to a named private method.

public void DoPayroll()
{
    var employees = new List<Employee>();
    var employeesToPay = employees.Where(EmployeePayrollIsDue);
}

private bool EmployeePayrollIsDue(Employee employee)
{
    return employee.IsPayrollToday() && 
          employee.PayrollEnabled && 
          !employee.IsTerminated;
}

Nested lookups

The operation above is quite straight forward and shouldn’t happen too often. The main problem usages with LINQ come in the form of nested operations. The way to get around the redundant iterations depends on the type of LINQ function you have nested.

First

Take this basic foreach loop below.

List<Payroll> payrolls = GetAvailablePayrolls();

foreach(employee in employees){
    var thisEmployeesPayroll = payrolls.First(payroll => payroll.Type == employee.PayrollType);
    //Do something with the payroll
}

The payrolls will be iterated for each employee until a matching payroll type is found. This is a huge overhead and is very wasteful when dealing with large data sets. If you saw a foreach loop inside this one you would easily identify the inefficiency and scald the developer who wrote it, but in this form it can be hard to spot.

The solution to this sort of problem is to create a hash map and use it to find the relevant payroll type for this employee. There is a handy LINQ method that even does this for us.

var payrollsByType = payrolls.ToDictionary<PayrollType, Payroll>(payroll => payroll.Type);

foreach(employee in employees){
    var thisEmployeesPayroll = payrollsByType[employee.PayrollType];
    //Do something with the payroll
}

Here a hash map (called a Dictionary in .Net) is constructed and used to find the payroll for each employee. Looking up an entry in the hash table costs virtually nothing and is an O(1) operation which means for any number of elements in the payrolls list it will cost the same amount.

Any/Contains

Another variant of this problem is when you are checking for the existence of an element in another collection.

List<Employee> employees = GetEmployees();
List<TerminationRecord> terminations = GetTerminations();

foreach(employee in employees){
    var employeeIsTerminated = terminations.Any(termination => termination.EmployeeID == employee.ID); 
    if(employeeIsTerminated) continue;

    //do something...
}

Again the list of terminated employees will be partially iterated for every employee.

This can be improved using a HashSet.

List<Employee> employees = GetEmployees();
List<TerminationRecord> terminations = GetTerminations();
HashSet<int> terminatedEmployeeIds = new HashSet<int>(terminations.Select(t => t.EmployeeID));

foreach(employee in employees){
    var employeeIsTerminated = terminations.Contains(employee.ID); 
    if(employeeIsTerminated) continue;

    //do something...
}

You can initialise the HashSet<T> using an IEnumerable<T> (in this case T is an int). There is no helper method like ToDictionary for hash sets. You can also create a hash set and use the Add method.

Sub grouping using Where

This is probably the most commonly used and the worst performing of all the nested LINQ scenarios. It occurs when you want to select a subset of a list for each in another or the same list. In the below situation we’re sending all employees all their payslips on record.

List<Payslip> allPayslips = GetAllPayslips();
List<Employee> allEmployees = GetAllEmployees();

foreach(employee in allEmployees){
    var payslipsForThisEmployee = allPayslips.Where(payslip => payslip.EmployeeID == employee.ID);
    SendPayslipsToEmployee(employee, payslipsForThisEmployee);
}

This performs really badly, as for each employee every payslip is evaluated. Consider 1000 employees and 10,000 payslips in the system - that’s a million iterations.

You can reduce this down to 1 iteration over all payslips and 1 iteration over all employees by using the ToLookup method.

List<Payslip> allPayslips = GetAllPayslips();
List<Employee> allEmployees = GetAllEmployees();
Lookup<int, Payslip> payslipsByEmployeeId = allPayslips.ToLookup(payslip => payslip.EmployeeID);

foreach(employee in allEmployees){
    var payslipsForThisEmployee = payslipsByEmployeeId[employee.ID];
    SendPayslipsToEmployee(employee, payslipsForThisEmployee);
}

In the above example we’ve gone from 1 million total iterations to 11,000 total iterations.

Collections keyed by reference types

In the above examples I have used integer IDs in order to more simply demonstrate the relationships between the collections, but that isn’t required. You can just as easily pass in reference types as the key types for each HashSet, Dictionary and Lookup. This is important to remember because where available you’ll most likely be doing your filtering by primitives in SQL.

Notes on optimisation

I am a firm believer that you should always profile first before optimising, to avoid doing it prematurely. However there are some situations where you will just know that thousands or millions of elements will be an issue and decide to use one of these optimisation methods. The key is to not let any change for optimisation (or anything for that matter) negatively affect the maintainability of the code.

Programming and things

Mainly programming stuff