My Data-Powered Skills-Based Job Search — Part 1
Dodging the mindless search and finding tech jobs that fit, with data doing the heavy lifting
I’ve spent much of the last decade working as a Data Scientist in Recruitment Tech. Data mining, building taxonomies, and visualisations have been my speciality. I’ve also done creative work with Recruiting Brainfood, including extracting a community (Hall of Fame) and archive (Larder) from the industry-leading channel.
After taking some time out to travel recently, I found myself with bandwidth looking for a new gig. I checked out job boards and was disappointed most of them are still about filling in a couple of text fields, making a large coffee and bracing yourself for another mindless scanning session.
So, I decided to turn this into a data project and use techniques I’d learned from the other side, some of which didn’t surface in client products because they just didn’t fit the standard job board model. Also, writing it up here in case it’s interesting to others. In parts, since it’s ongoing.
As well as surfacing great matches, I hoped to learn how the whole tech landscape is shifting over time. Which job titles or skills were up and which were down (a little like the skillsheat project I’d worked on for a while). Should I steer my career more towards GenAI?
Talking of AI, this felt like a good place to get familiar with using LLMs with real data with a real user with a mission (me).
1. Getting stuck in
The data-driven approach obviously relies on a volume of data that I can quality control and add to continuously for recurring value.
So I went to the trouble of setting up a crawling pipeline. Fortunately, job posts have to be public for maximum exposure. You don’t want to put ads behind walls.
Crawling the raw data
The first step was to build up a large database of job postings. It takes time to get enough for really solid data mining. Currently, I have a MongoDB with 1.6M posts, including headline, description, source, and crawl timestamp.
From this, I wanted to extract clean job titles and job skills. Pulling clean, standardised job titles from job post headlines is a hairy but essential task if you want to get anywhere with them.
Job post headlines are messy. They contain job titles but often a load of other crap. So, I needed a clean list of job titles and their aliases. Rather than take a standard list, I wanted a highly-granular list of titles specific to my field — tech — actually observed in the wild right now.
Those informative key phrases in the titles and job descriptions, usually called skills, also needed to be extracted. At its simplest, you can just exploit the conventional structure of job descriptions for this: bullet points, lists and leading phrases (‘Experience with…’).
I selected a few boards focused on the tech field without hanging on one in particular. Since job boards are not so different in their structure, I can reuse much of the code for them.
By scheduling Playwright scripts to crawl daily, I can apply promptly for anything that is interesting. Job ads can also be pulled fast so I want to get the description before it goes in any case. It’s straightforward engineering, although I still got to have fun with rate limits, promoted posts, reposts, subtle site changes, and quirky listings traversals.
Extracting and filtering the entities
Then came the crunchy part: shelling the job titles from the job headlines and picking out the key phrases, aka skills, from the job descriptions.
Also known as entity extraction, this is where engineers start to squirm since it’s fuzzy, hacky, and case-specific. It can feel like a dark art, drawing on domain knowledge, previous experience, and a dash of slightly obsessive focus.
However, if you don’t do it then it bites you later. You’ll be treating GCP separately from Google Cloud Platform. Fuzzily mixing up AI, BI, and CI. Listing Scikit-learn, Sci-kit learn, sklearn as cousins.
Algorithms cannot bypass this step, but they can make it scalable for a human with a well-crafted workflow. From a couple of passes, I now have 557 job titles and 1105 skills. I actually want to get that number down since it probably still contains some duplicates which should be merged.
Some duplicates are obvious, but others can be subjective. Software engineer is probably the same as software developer but is Data Architect the same as Data Engineer? In those cases, it turns out we can actually use the data to help settle things.
Using the data to help normalise the data
The general principle I’m applying here is that two job titles are likely to mean the same thing if they are associated with the same balance of skills.
A picture can help. Here is a simple example of Front End Developer compared to Front End Engineer. Each title has the skills it appears with below it. The skills are ordered by their count, shown by the width of their bars. To help compare the ordering, the bars are also matched with a link.
We see at a glance that the same skills come up near the top for both titles, so we are confident they can be merged. Since Front End Developer is more than twice as common as a title, we should make Front End Engineer an alias for it.
Why do we do this? It makes the stats stronger and averts a UI that asks a user to identify as one or the other.
We could also order the skills in order of how strongly they are statistically associated they are to the titles. If we do this then the really specific ones jump out (association also shown by shading).
I soon realised this has several other valuable uses as a byproduct, e.g.
- Scanning your own job title and checking you have good coverage over the in-demand skills.
- Scoping out possible adjacent roles for a career move, seeing which skills were portable and which might need acquiring.
So, I spent a little time polishing this up and pushed it out: Tech Title Breakdown.
Going from job titles to skills-based profiles
Job titles are a good starting point, but to make it really personalised, I need to hand-pick the skills for a tailored skill profile (there are even different flavours of Data Scientist).
I can select skills from the bars in the job title breakdown, but if I wanted to go all-in with a skills-based profile I needed a totally skill-centric view. One that shows you your selection plus other skills in the neighbourhood. Fortunately, I have some experience with such a view.
The Skills Graph takes your selection over and shows you the skills related to them for further expansion. The size of the skills shows their demand, and the width of the links shows how strongly they are associated.
With an updated model, I can track emerging skills — or spot the ones falling out of favour — keeping my profile relevant.
2. Better matching
Circling back to the original motivation, I now have a skill profile which I can use for highly-granular matching to job posts in the database. The more tags that match, as a percentage and an absolute number, the better the fit.
How can I use AI appropriately to get the best benefit from it?
The most compelling use is to summarise the job description, distilling it down to just the info I care about. Highlighting the tech stack and ditching the boilerplate and waffle.
If I maintain a textual brief with my skills then I can generate an embedding of my ideal role and then find matches of job posts that are close in the embedding space, job posts that are talking about the same sort of thing (according the LLM — I used OpenAI).
With that, I start to converge on an interface which looks like this:
Not only is it matching but there is transparency about the skills and actually a way to reduce that tedious scrolling with extracted skills and summarised description.
Coming up in Part 2
It’s still just in the PoC stage. It doesn’t cover all the job boards, and the AI use is still being explored. However, I’m finding I’m already liking the interaction more than a regular job site. It optimises for my user experience by respecting my time and actively helping me refine my search profile.
In the next part, I’ll be leveraging the AI more. Exploring what it can do to make my life as a candidate even easier, e.g.
- Identifying possible deal breakers or selling points. Using them to filter and pitch opportunities to me.
- Refining my profile via a chat interface based on the feedback I give it.
- Achieving the above reliably, without hallucinations or other nonsense.
This moves me away from being a purely passive candidate to one who is always aware of an outstanding opportunity — even if I’m fully engaged in a project.
I’m posting updates on LinkedIn; feel free to connect with me there. Or just drop me an email at si@shellsi.com for a chat. As I say, I have some time right now.
Meanwhile, go ahead and bookmark OnVocation where I’ll be making stable updates.