‘Dark’ Recommendation Engines: Algorithmic curation as part of a ‘healthy’ information diet.

In an ever-growing digital landscape filled with more content than a person can consume in their lifetime, recommendation engines are a blessing but can also be a a curse and understanding their strengths and weaknesses is a vital skill as part of a balanced media diet.

9 min readSep 4, 2020

If you remember when connecting to the internet involved a squawking modem and images that took 5 minutes to load then you probably discovered your favourite musician after hearing them on the radio, reading about them in NME being told about them by a friend. Likewise you probably discovered your favourite TV show by watching live terrestrial TV, your favourite book by taking a chance at your local library and your favourite movie at a cinema. You only saw the movies that had cool TV ads or rave reviews — you couldn’t afford to take a chance on a dud when one ticket, plus bus fare plus popcorn and a drink cost more than two weeks pocket money.

In the year 2020 you can plug your phone into your car, load up Spotify and instantly access over 40 million songs at the touch of a button. You can almost any TV show or movie from the last 60 years from your couch. You can read almost any book ever written for free or next to nothing online (especially if your library has free ebook access like mine). In the space of a few years, our media consumption habits have COMPLETELY changed and that is wonderful and amazing in a kind of utopian star trek “land of plenty” kind of way.

Unfortunately there’s a downside to having access to the entirety of humanity’s collective knowledge at the click of a button. With so much choice and 3 weeks of video content being added to youtube every minute it is easy to become overwhelmed. Humans aren’t good at choices that have too many options. We are overcome with analysis paralysis and left unchecked, we can waste hours of our lives scrolling netflix, reading show synopses but never watching any shows. After all, time is precious and a 90 minute movie is a sizeable, non-refundable investment. What if you don’t like it when there’s thousands of hours of other movies that you could be watching instead that could be better? Solving this problem across all sorts of media (news articles, movies, songs, video games) was the original motivation behind recommendation systems.

Recommender Systems 101

Recommendation engines are all about driving people towards a certain type of content — in the use case above, it’s about driving people towards stuff they’ll like so that they feel like they’re getting value out of the platform they’re paying for and they continue to use the platform. There are a few different ways that recommender systems work but here are the basics:

Collaborative Recommendation

If Bob buys nappies (diapers) and Fred buys diapers AND powdered milk then maybe we should recommend powdered milk to Bob

The above sentence summarises the underlying theory behind collaborative recommenders. We can build a big table of all of our customers and the products that they bought (or movies that they watched) and we can use a technique called matrix factorization to find sets of products that commonly get consumed together and then finding users who already consumed a subset of these products and recommending the missing piece. The below video explains this concept in more detail.

Collaborative filtering has a neat little surprise up its sleeve: emergent novelty. The chances are that someone you don’t know who has similar taste to you is in a good position to introduce you to new content that you didn’t know you liked. If Bob buys a coffee machine and we recommend it to Fred, the latter user might go “oh wow, I am pretty tired, I hadn’t considered a coffee machine — neat!” Of course this can have the opposite effect too.

Content-based Recommendation

Bob likes Terminator 2 which has the properties: ‘science fiction’, ’80s movie’,’directed-by-James-Cameron’ he might also therefore like “Aliens”,

Content-based recommenders, as the summary above suggests, are all about taking properties of the content and using them to draw similarities with other content that might interest the user. Content-based recommendation is more computationally expensive than collaborative filtering since you need to extract ‘features’ of the things you’re recommending at scale (e.g. you might build an algorithm that looks at every frame of every movie in your collection and checks for cyborgs). It’s also very hard to do feature extraction on physical products and e-commerce sites tend to stick to collaborative approaches.

Content-based recommenders can sometimes get stuck in an echo-chamber mode of recommending very ‘samey’ stuff all the time — there’s no element of surprise or novelty like you’d get with collaborative filtering.

Hybrid Content-Collaborative Recommendation

Bob likes Terminator — an 80s sci-fi movie, Fred likes Terminator- an 80s sci-fi movie and Aliens, Janet likes Ghostbusters, an 80s sci-fi comedy. Recommend Aliens and Terminator to Janet and Ghostbusters to Bob and Fred.

In this mode of operating, we get the best of both worlds. Terminator and Aliens have a very different tone to Ghostbusters but there’s a decent chance that Bob and Fred would like it and there’s some ‘feature’ overlap between the three movies (80s, sci-fi).

Hybrid recommendation is also pretty useful when you have limited information about your users because they only just joined or they didn’t use your system very much yet (This is known as the cold start problem). For example, if a new user, Rachael, comes along we can’t use collaborative filtering because we don’t know what films she likes and what other users with her taste have watched. However, we could give her an on-boarding questionnaire and if she tells us she likes 80s sci-fi but not comedy then we can recommend Aliens, Terminator and not Ghostbusters. The more we learn about her, the better these recommendations will get.

Manipulation and ulterior motive: the dark side of recommendation engines

Recommendation engines are a great way to introduce people to movies, songs, news articles and even physical products that they might be interested in. But what if the motivation behind your recommendation system is no longer to make the user happy? As long as we have a large, consistent set of data relating products (movies/songs/books etc) to users we can train a recommendation engine to optimise itself towards that end. We could train a recommender that always makes terrible recommendations by flipping the dataset we collected about what users like — not a particularly useful exercise but it could be fun.

What if the recommendation engine serving up your news articles isn’t optimised to show you what you like but in fact is optimised to show you more of what keeps you engaged? There may be some overlap here but the distinction is key. All the system’s owner would need to do is collect a table of content that the user likes or comments on or shares.

The phrase “there’s no such thing as bad press” is a lot older than social media but has never been more relevant. For decades, traditional print media outlets have used bad news and emotive content to sell more papers. Journalists have become experts at politicising and polarising everything from avocados to gen z. Online news outlets use a similar mechanism.

Online news outlets don’t make money from selling print media but from selling space on their websites for showing adverts and they get paid for every person who clicks on an advert. It’s probably only 1 in 1000 people that clicks on an ad but if 100,000 people read your article then maybe you’ll get 100 clicks. This has given rise to “clickbait” headlines that use misleading exaggeration to pull users in to what is more often than not an article of dubious or no interest. Clickbait, at least, is usually fairly easy to detect since the headlines are pretty formulaic and open ended (that’s my one neat trick that journalists hate me for).

Social networks, like online news outlets, also make money from driving users towards adverts. Most people would read a news article once and close the page, 1 in 1000 of them might click a relevant advert while they’re at it. However, users typically spend a lot more time on a social network site, liking their neighbour’s cat picture, wishing their great aunt a happy birthday, getting into arguments and crucially clicking adverts. The longer you spend on the social network site, the more adverts you’re exposed to and maybe, just maybe, if you see the picture of the new coffee machine enough times you’ll finally click buy.

So how can social networks keep users clicking around for as long as possible? Maybe by showing them content that piques their interest, that they respond emotionally to, that they want to share with their friends and comment on. How can they make sure that the content that they show is relevant and engaging? Well they can use recommendation engines!

A recipe for a “dark” recommendation engine

In order to train a pretty good hybrid recommendation engine that can combine social recommendations with “features” of the content to get relevant data we need:

Information about users — what they like, what they dislike — what they had for breakfast (they know it was a muffin and a latte from that cute selfie you uploaded at Starbucks this morning), what your political alignment is (from when you joined “Socialist memes for marxist teens” facebook group) — CHECK
Information about the content — what’s the topic? Does it use emotive words/swears? Does it have a strong political alignment either way? — using Natural Language Processing they can automatically find all of this information for millions of articles at a time — CHECK
Information about users who interact with certain content — they know who on commented what. They know that the photo of your breakfast got 25 likes 2 comments and that the news article in the Washington Post about Trump got 1500 likes, 240 angry reacts and 300 comments. They also know that 250 of the 300 comments were left by people from the left-wing of politics — CHECK

That’s all they need to optimise for “engagement”. A hybrid recommendation engine can learn that putting pro-Trump articles in front of people who like “Bernie 2020” is going to drive a lot of “engagement” and it can learn that displaying articles branding millenials as lazy and workshy in front of 20-to-30-somethings is going to drive a lot of “engagement” too.

Recommendation engines can learn to only ever share left wing content with left wing people, likewise for right-wingers — creating an echo-chamber effect. Even worse, articles containing misinformation can be promoted to the top of everyone’s “to read” list because of the controversial response they will receive.

These effects contribute to the often depressing and exhausting experience of spending time on a social media site in 2020. You might come away miserable but the algorithm has done its job — it’s kept a large number of people engaged with the site and exposed them to lots of adverts.

Good news everyone!

Let’s face it — its not all bad — I love pictures of cats sat in boxes and the algorithms have learned this. Spotify has exposed me to a number of bands that absolutely love and that would never get played on the local terrestrial radio station I periodically listen to in the car. I’ve found shows and books I adore on Netflix and Kindle. I’ve found loads of scientific papers that were very relevant for my research into NLP using sites like Semantic Scholar

I guess its also worth noting that the motivation of media platforms like Netflix and Spotify is to help you enjoy yourself so that you pay your subscription as opposed to ‘free’ social sites that are happy to make you miserable if it means that you’ll use them for longer.

The aim of this article was to show you how recommendation engines work, why the motivation for building them is SO IMPORTANT. Secondly, I wanted to show you that it’s important for us to diversify our information intake beyond what the big social media platforms spoon feed us.

You can use sites like reddit where content is aggregated by human votes rather than machines (although fair warning, controversial material can still be disproportionately represented and certain subreddits might depress you more than your social media feed).

You can use chronological social media systems like mastodon that don’t shuffle content around hoping to get you to bite on something juicy. I can also recommend the use of RSS reader systems like Feedly which aggregate content from blog sites in chronological order with minimal interference.

Finally I want to issue a rallying cry to fellow machine learning engineers and data scientists to really think about the recommendation systems that you’re building and the optimisation mission you’ve been set. Would you let your family use it or would it make them miserable? Be responsible and be kind.

Originally published at Brainsteam.