Resume Ranking using Machine Learning- Implementation.

In an earlier posting we saw how ranking resumes can save a lot of time spent by recruiters and hiring managers in the recruitment process. We also saw that it lends itself well to lean hiring by enabling selection of small batch sizes.

Experiment – Manually Ranking Resumes

We had developed a game for ranking resumes by comparing pairs with some reward for the winner. The game didn’t find the level of acceptance we were expecting it to find. So we thought of getting the ranking done by a human expert. It took half a day for an experienced recruiter to rank 35 resumes. Very often the recruiter asked which attribute was to be given higher weightage? Was it experience or location or communication or compensation?

These questions indicate that every time we judge a candidate by his resume; we assign some weightage to various profile attributes like experience, expected compensation, possible start date etc. Every job opening has its own set of weightages which are implicitly assigned as we try to compare the attributes of a resume with the requirements of the job opening.

So the resume ranking problem essentially reduces to find the weightages for each one the attributes.

Challenge – Training Set for standard ranking algorithms.

There are many algorithms to solve the ranking problem. Most of the ranking algorithms fall under the class of “Supervised Learning” which would need a training set consisting of resumes graded by an expert. As we saw earlier this task is quite difficult as the grade will not only depend on the candidate profile but also on the job requirements. Moreover we can’t afford the luxury of a human expert training the algorithm for every job opening. We have to use data that is easily available without additional efforts. We do have some data of every job opening as hiring managers screen resumes and select some for interview. Its easy to extract this data from any ATS (Applicant Tracking System) . Hence we decided to use “Logistic Regression” that predicts the probability of a candidate being shortlisted based on the available data.

We have seen that “Logistic Regression” forecasts the probability based on weightages for various attributes learned from which resumes were shortlisted or rejected in the past. This probability in our case would indicate if the candidate is suitable or not. We would use this number to rank candidates in descending order of suitability.

Available Data

In our company we had access to data of the following 13 attributes for about 3000 candidates that were screened for about 100 openings over the last 6 months.

1)Current Compensation, 2)Expected Compensation, 3)Education, 4)Specialization, 5)Location , 6)Earliest Start Date, 7)Total Experience, 8)Relevant Experience, 9)Communication, 10)Current Employer, 11)Stability , 12)Education Gap and 13)Work Gap.

We needed to quantify some of these attributes like education, stability , communication etc. We applied our own judgment and converted the textual data to numbers.

Data Cleaning

We were unsure whether we will get consistent results as we were falling short of historical data of resumes. We ignored openings that were barely having 10 or less resumes screened. On the other hand we also discovered a problem with large training sets – particularly in the case of job openings that drag and remain open for long. These job openings are likely to have had change of requirements. As we learned later; consistent accuracy was obtained for job openings having training sets whose population was in the range of 40 to 80 resumes.

Running Logistic Regression

We had listed 22 openings for which several hundred resumes were presented to the hiring managers in the last 6 months. We have record of interviews scheduled based on suitability of the resumes. We decided to use 75% of the available data to train (Training Set) and 25% to test (Test Set) our model. The program was written to produce the following output-

  • Vector of weightages for each one of the 13 attributes
  • Prediction whether the set of test cases would be “Suitable” or “Unsuitable”

The result was based on how accurate was the prediction Accuracy is defined as

Accuracy = (True Positives + True Negatives)/ (Total # of resumes in the Test Set)

Where “True Positives” is the number of suitable resumes correctly predicted to be suitable. Similarly “True Negatives” are the number of unsuitable resumes predicted as such. We achieved average accuracy of 80% ranging from 67% to 95%.

Efforts to improve accuracy

Pay Vs Experience -Plot of Suitable Candidates

Plot of suitable and unsuitable resumes on Experience vs Pay didn’t show any consistent pattern. The suitable resumes tended to be more of highly paid individuals who had lower experience. Which is kind of counterintuitive. Other than this the suitable resumes tended to cluster closer to the center of the graph as compared to the unsuitable ones.

Given the nature of the plot the decision boundary would be non linear- probably a quadratic or higher degree polynomial. We decided to test using a 6th degree polynomial – thus creating 28 attributes from 2 main attributes – viz. experience and pay. We ran the program again this time with these 28 sixth degree polynomial and remaining 11 attributes thus a total of 39 attributes. This improved the accuracy from 80% to 88%. We achieved 100% accuracy for 4 job openings.

Regularization had no impact on accuracy. Hence we didn’t use any cross validation set for testing various values of the regularization parameter.

Values of weightages or parameters varied slightly every time we ran to find the minimum of the cost function. Which indicates that the model found a new minimum in the same vicinity every time we ran the program with no changes to the training set data.

Some Observations

Varying Weightages for Candidate Profile Attributes

If you take a close look at the chart above ; we observe the following-

  • One job opening gives extremely negative weightage to “Current Compensation” – this means that candidates earning well are not suitable. While its just the opposite case for most other job openings.
  • C++ Developer position assigns positive weightage to “Total Experience” but negative weightage to “Relevant Experience”. The requirement was for a broader skillset beyond just C++.

We can go on verifying the reasons for what turns out to be a fairly distributed set of values for weightages for various attributes. Each job opening has pretty much independent assessment of the resumes and candidates.

As expected we observed that accuracy generally increases with sample size or size of the training set.As mentioned earlier accuracy was observed to be low in the case of job openings that remained open for long and the selection criteria underwent change.

Ranking Resumes using Machine Learning

In a recent article we saw how ranking resumes can help us keep the WIP within limit to improve efficiency. We also saw an interesting way of achieving this is by playing a mobile game. In this article we will see how machine learning can be applied to rank resumes.


This article covers a “Quick and Dirty” way to get started. This is no way the ultimate machine learning solution to the resume ranking problem. What I did here took me less than a day of programming. This could serve as an example for students of machine learning.

Problem Formulation

We train the machine learning program by using a “training set” of resumes which are pre-screened by a human expert. The resume ranking problem can be seen as a simple classification problem. We are classifying resumes into suitable (y=1) or unsuitable (y=0).  We know from machine learning theory that classification problems are solved by using the logistic regression algorithm.

Sigmoid for Resume

Sigmoid Function Showing Probability of a Resume being Suitable

We know the predictor function represents a value that lies between 0 and 1 as shown in the diagram above. The predictor or hypothesis function hθ(X) is expressed as-

hθ(X)=1/1+e-Z where z= θTX

where X is a vector of various features like experience(x1), education(x2), skills(x3), expected compensation(x4) etc. which decide if a resume is suitable or not suitable. The first feature x0 is always equal to 1.

Features & Parameters

Features & Parameters

hθ(X) can also be interpreted as the probability of the resume being suitable for  given X and θ. So the resume ranking problem is essentially solved by evaluating the function hθ(X) with the resume yielding highest value of hθ(X) getting the top rank.

With this prior knowledge of machine learning and logistic regression we have to find θ by studying a training set of resumes some of which were selected to be suitable -remaining ones being unsuitable.

Simplification of the problem

To further simplify the problem let us not bother about all the attributes like experience, education , skills, expected compensation, notice period etc. while ranking the resumes. As we saw in this earlier post ; we need to worry only about the top constraints. We selected the top constraints as those constraints which address “must have” features that are “hard to find”. Another benefit of limiting  these top constraints is that the same can be quickly and easily evaluated by the recruiters in short telephonic conversations with the candidates. This makes the process more efficient as it precedes and serves as a filter before the preliminary interview by the technical panel.

Decision Boundary

Training set is a set of resumes that are already known to be suitable or not suitable based on past decisions taken by the recruiters or hiring managers. Let us plot the training set for a particular opening based on past records. For the purpose of this article let us say that resumes are ranked only on the basis of 2 top constraints viz. relevant experience (x1) expressed in number of years and expected gross compensation per month(x2). The plot would look somewhat like what we see below.

Decision Boundary

Decision Boundary


If you draw a 450 line cutting the X1 axis at X1=3, the same can be seen dividing the training set so that every point below the line represents a suitable resume and every point above it represents an unsuitable one. This line is machine learning terms is called the decision boundary. We can say that all the points on this line represent resumes where probability of them being suitable is 0.5.  This is also the point where z=0 as we have seen in the diagram above  showing the sigmoid function-


This equation represents a point on the sigmoid function where

Z=0   – replacing Z with θT X

θT X= 0

-3+X1+X2=0  – represents the Decision Boundary

Gradient Descent 

Though we have visually plotted the decision boundary ; it may not be the best fit for the training set data. To get the best fit we can use gradient descent to minimize the error represented by the following equation-

 J(θ)=-1/m[i=1m y(i)*log(hθ(x(i)) )– (1-y(i))*log(1- hθ(x(i)))]

– where m is the number of instances in the training set and X(i) is a vector representing x0,x1,x2 for the ith instance in the training set of resumes. y(i) takes value 1 if the ith instance was suitable and 0 otherwise. Here we are trying to minimize the function J(θ) by finding out a value of θ that minimizes the error function. Here θ is a vector of θ0, θ1 and θ2.

We can minimize J(θ) by iteratively replacing θ with new values as follows. Each iteration is  step of length α is for descending down the slope till we reach the minimum where the slope is zero.

θj:= θj-α(i=1m(hθ(x(i))- y(i))* x(i))


We wrote the code to execute this in octave – as it’s a known bug-free implementation of machine learning algorithms and vector algebra. There are libraries available in Python and Java to build a more robust “production grade” implementation.

Limitations and roadmap for further work

The logistic regression algorithm is useful only if you have a reasonably large training set – at least 25 to 30 resumes. We also need to have the same selection criteria for the algorithms to work – hence you can’t reuse training sets across different job positions.  There are some “niche” positions where its impossible to find enough resumes- its both difficult and unnecessary to implement machine learning in such cases.

There are many “to-dos” before this program can be made useful. We need to use more features – particularly those which are “Must Have” types. We also need to have more iterations of the gradient descent with different values of  α . Lastly we need to have more resumes in the learning set to be able to further break it down into training set, validation set and test set.


Its particularly challenging to rank 20 or more resumes even though the ranking is based only on 2 or 3 attributes. Recruiters often skip this step as it tends to be tedious and end up wasting a lot of hiring managers’ time. Its an error prone process if a junior recruiter is assigned the task.By automating resume ranking, we hope to avoid human error. We also hope to get early feedback and improved understanding of important attributes or top constraints by limiting the short list to top 3 resumes. Lastly it takes a few seconds for this crude Machine Learning program to rank 20 resumes- something that would take 10 minutes for an experienced recruiter.


Take Aways from Agile 2013 – Part 4 of 4 Data Migration Issues Addressed by NoSQL Databases

This session by Rebecca Parsons was very insightful. Data models change as products evolve through iterations. Scott Ambler says that relational databases evolve in agile manner by refactoring/migration pair in small steps. Like everything else; data changes. Our understanding and access patterns change- requiring database migration.

Code changes can be easily managed in version control repositories. Data is not version controlled – but data models are. Developers need to meticulously create and maintain data migrations so that data can be rolled forward or backward in sync with the code. Developers also need to provide default values for columns in records created by older versions where those columns didn’t exist. Over all data migration is a hairy problem.

One would tend to think that data migrations for Big Data would be a bigger problem. NoSQL databases are characterized as non relational , schemaless, cluster friendly, open source, 21st century web. You have Raven, Couch and Mongo – which work with Documents, HBase and Cassandra which work with columnar data, Riak and Redis which work with Key-Value and Neo4j which works with graph. Each of these have different ways of dealing with migrations.

NoSQL databases like MongoDB provide a clean way to address this problem. The loose structure of MongoDB allows data from multiple versions to co-exist in the same database. All data doesn’t have to look the same. This makes evolution easier- e.g. no change in data is needed to add a field. Thus there is no need to run large scale migration to roll a version forward or backward.

This does not mean that migrations are never needed. You do require migration to add a non sparse index in MongoDB. Migrations in graph databases like neo4j are a bit more complex. However we can conclude that NoSQL databases can be modeled for easier migration.

Tips on Emergent Design- Takeaways From Agile 2013 – Part 3 of 4

I always attend some technically heavy sessions. Often some of what I hear goes over my head – but whatever I learn from rest of it is worth the time. Neil Ford’s session at this conference was one of these.

I am going to highlight some of the learning in the form of tips – listed below in no particular order

Emergent Design– waiting till the last responsible moment to take design decisions. Often trying to predict leads to over-engineering. Prefer proactive to predictive or reactive. Over time a component starts taking more responsibility – that is when you should start tracking its last responsible moment. Another case or over engineering happens when objects start looking too much like real life objects- Neil calls them anti-objects.  Emergent design also needs developers to find abstractions defined by idiomatic patterns – both domain and technical. Developers should realize that up-front design can address knowns and known unknowns. It can’t address unknown unknowns. Waiting till the unknowns become known is the only solution. Emergent design also requires documentation to remain in sync with the code. Neil says the best way is to write code that reads like design document.

Importance of Ruby– Its easy to write Java unit tests in Ruby. Prefer language based build tools like Buildr than XML based build tools like Maven. Generally Ruby is a good language for build and test engineers. Ruby has much simpler monkey patching which works exactly like point-cuts in AOP ; allowing testers a sneak peek at what is going on at entry and exit points of important methods/ code blocks.

Cyclomatic Complexity (CC) and Afferent Coupling(AC)- CC and AC are good metrics to sniff out refactoring opportunities. Once you see these opportunities ; you can harvest them by abstracting out dependencies and exposing APIs for the same. CKJM tool measures and reports CC and AC at class level. Events such as merging code bases or reduction in test coverage introduce CC. CC per line peaks near release time as everyone rushes to ship it .

Importance of TDD – Need to write tests for all code including private methods. TDD helps you zero down your debugging efforts at brick level instead of building level. It has been observed that code written using TDD has Cyclomatic Complexity of  2 vs 10 for code without TDD.

Technical Debt- The longer the time taken to put to use a feature that is ready more the Technical Debt incurred. Developers often try to introduce genericness in their code-which increases software entropy leading to accidental complexity. Sonar is a build and automation product- which has a technical debt calculator. Estimating technical debt helps in clearing it. But the real trick it does it that the team starts negotiating repayment with customer or management. Runaway technical debt and management’s insistence on sticking to a particular technology hinders emergent design.