In 2016, I led a small team at Instagram that designed and built one of the largest content distribution experiments in history: the introduction of a personalized ranking algorithm to the platform’s (then) 500 million users. Anticipating controversy, we spent the next several years scientifically measuring differences between people receiving this evolving “recommendation algorithm” (as it’s sometimes called) and a small randomly chosen group receiving the reverse-chronological feed employed since Instagram’s inception. 

Those differences suggested an overwhelmingly improved experience with the new algorithm for every aspect of the app.

While I remain confident that algorithmic ranking is the best choice for social media platforms, it is not without downsides. To name a few: increased platform control over content distribution, opaque operating criteria, risks of promoting harmful content, and general user frustration. Those downsides recently led Twitter’s potential future owner, Elon Musk, to call for “open sourcing the algorithm.”

As an engineer, this idea sounds overly simplistic given how little open sourcing a machine-learning model tells us about its effects. But the call for transparency is valid, and it can begin with disclosure into experiments similar to the one I led at Instagram. Useful transparency, I’d argue, lies in open-source experimentation rather than algorithms. 

I am not proposing what should be done with the information that comes from open-source experimentation; rather, this article is a starting point for thinking about transparency in the context of modern ranking systems. In it, I discuss why experimentation is both essential in algorithmic ranking and a better focus in future efforts to demystify content distribution on social media. 

Modern algorithms prioritize the “most interesting” content

Most social platforms have much more content than anyone could reasonably consume.

Instagram launched in 2010 with a reverse-chronological feed, which displays the newest “connected” content (meaning content from people you choose to follow) at the top of a user’s feed. After six years, the average user was seeing only 30% of their connected content. Attention spans are fixed, so we reasoned this amount represented the natural limit of what an average person wanted to consume. The purpose of introducing algorithmic ranking was to make that 30% the most interesting content rather than the most recent. Other platforms like TikTok, YouTube, and Twitter have their own ratios (i.e., they make different amounts of content available), but the approach of selecting the most interesting content given a fixed attention span is the same.

The choice of exactly how a ranking algorithm distributes content dictates the meaning of “most interesting.” One option is to make things unpersonalized — everyone who’s eligible to see the same set of content sees it in the same order. Algorithms built to show the most-liked content first, or choose the most beautiful photos, or even highlight “editor’s picks” all fall into that category. But taste itself is highly personalized; two different users who follow the same people will nonetheless prefer different content. Unpersonalized ranking fails to capture “most interesting” at the scale of billions. 

Modern ranking algorithms, by contrast, are personalized: The algorithm makes different content selections depending on who’s browsing. It’s impossible to read a user’s mind and know their precise preferences, but a machine-learning model can draw on past behavior to predict answers to questions like, “If you were to see this content, what’s the chance you would like it, comment on it, share it, watch it, skip it, or report it?”

Algorithmic ranking combines these predictions with extensive business logic (e.g., diversifying content, biasing against hateful content, promoting content from lesser known accounts) to form the basis for determining the most interesting content for a given user. 

Why “open sourcing” the algorithm doesn’t work

Here’s my understanding of what people calling for open-source algorithms envision: If we publish the internal source code and weights of machine-learning models involved in ranking, then engineers, analysts, and others will be able to understand why certain content is promoted or demoted. The truth is that even complete transparency into models still tells us little about their effects.

Predictions from machine-learning models vary based on the user, the content, and the circumstances. Those variations are broken into “features” that a machine-learning model can consume to make a prediction. Examples of features include: recent content a user’s consumed, how many of a user’s friends liked something, how often a user engaged with a certain person in the past, and the engagement per view of people in a user’s city.

The calculus behind “net good” — not the micro-details of a particular ranking algorithm — determines if an experiment is successful.

Modern algorithmic ranking models take into account millions of these features to spit out each prediction. Some models depend on numerous sub-models to aid them; some will be retrained in real time to adapt to shifting behavior. These algorithms are complex to make sense of, even for the engineers working on them.

The size and sophistication of these models make it impossible to fully understand how they make predictions. They have billions of weights that interact in subtle ways to make a final prediction; looking at them is like hoping to understand psychology by examining individual brain cells. Even in academic settings with well-established models, the science of interpretable models is still nascent. The few existing methods to help comprehend them involve the privacy-sensitive datasets used in training. Open sourcing algorithmic ranking models wouldn’t change that.

When does an experiment cause a “net-good” change?

Engineers like me measure predictive ability. Instead of seeking to understand the inner workings of algorithms, we experiment and observe their effects. Ranking teams (typically a mix of data scientists, engineers, product managers, and researchers) might have thousands of concurrent experiments (A/B tests) that each expose groups of people to variants of ranking algorithms and machine-learning models.

The biggest question driving an experiment is whether a change is — to use a term I came up with — “net good” for the ecosystem. During the introduction of algorithmic ranking to Instagram users, we observed significant improvements in product interaction and insignificant changes in reported quality of experience. After a team decides an experiment causes a net-good change, as we did, it becomes the platform’s default user experience and subtly changes the content that hundreds of millions of people see every day.

Determining net good entails analyzing the effects of experiments through summary statistics about shifting user behavior and content distribution (i.e., which types of content get promoted and demoted). For example, a team can look at how often users check an app or “like” content, how much time they spend on the app per day or per session, how often someone says they are having a “5 out of 5” experience, whether “small” creators are favored over “large” ones, the prevalence of “political” content, and so on. Summary statistics are produced by crunching enormous amounts of individual user actions — you are in the test group, you logged on at 3 p.m., you saw your best friend’s video and then liked it, you missed another post by a celebrity, etc. and easily number in the thousands. Teams look for statistically significant changes in those statistics between test and control groups.

It is not sufficient to say “open-source all the data” — that’s an innovation and privacy nightmare. But it is possible to safely disclose more than companies do today.

Any well-functioning algorithmic ranking team has a methodology for deciding whether a change is net good compared to an established baseline. The methodology might be codified: Anything that increases the number of active users is net good. Or it might be judgment-based: If person X signs off after seeing summary statistics, it’s net good. Or it might be adversarial: If no team can find a problem, it’s net good. In practice, it might be a mixture of everything. 

The calculus behind net good — not the micro-details of a particular ranking algorithm — determines if an experiment is successful. Experiments guide the success of ranking teams in a company. And the success of ranking teams guides how content is distributed for all platform users.

With net good being such a powerful designation, it makes sense to call for open sourcing in experiments.

What open source means for experiments

The problem with our current system is that the people running experiments are the only ones who can study them. While there are good reasons for this, the people making ranking changes aren’t necessarily incentivized to find certain kinds of issues the way the broader community might be. (Indeed, this is something the open-source movement in software has historically been good at — i.e., relying on a community of engineers to spot problems and contribute improvements, in addition to the core developers working on the project.) By providing the community with more transparency about the experiments, the teams in charge of them can establish best practices for making decisions and reveal effects from experiments beyond what the team is studying. 

In opening sourcing experiments we need to balance two competing interests: keeping enough proprietary information to let companies innovate while disclosing enough to allow external understanding. It is not sufficient to say “open source all the data” — that’s an innovation and privacy nightmare. But it is possible to safely disclose more than companies do today. Disclosures could take place in two ways:

  1. Open-source methodology: What is the intent of ranking changes? What team goals and decision-making can safely be disclosed without harming company innovation?
  2. Open-source experimentation: What are the consequences of ranking changes? What information can be shared to allow third parties such as auditing agencies to examine the effects of ranking experiments without sacrificing user privacy? 

Disclosure itself doesn’t solve larger issues of incentives in algorithmic ranking. But it gives the broader community an informed basis to think about them, and it focuses research and attention on where it can have the most impact.

Open-source methodology

It’s important to remember that the big decision in algorithmic ranking is what constitutes a net-good change. Encouraging open-source methodology allows more insight into how such decisions are made and how platforms evaluate their content ecosystem. The data involved would already be summarized, which precludes concerns about violating individual privacy. The risks of disclosure, then, are primarily about competitive advantage and bad actors such as spam farms and coordinated attackers. To start, here are three types of info that would not be risky for a platform to share:

  • The general process for deciding if a new ranking variant is a net-good change.
  • Who, if anyone, has decision-making power on wider algorithm changes.
  • An explanation of summary statistics available in decision-making and evaluated in experiments.

A hypothetical disclosure involving that information might look like this: Each year, a platform’s executive team sets targets for engagement measures, plus secondary targets related to content quality. The ranking teams responsible for hitting the targets are allowed to run up to 1,000 experiments a year, each involving millions of users. A product manager is required to review the experiments before they begin, and meets once a week with the responsible ranking teams to review the ongoing impact on the primary and secondary targets, among any other effects that emerge as statistically significant, such as content shifts to larger accounts or the prevalence of politically tagged content. Then, the final decision regarding whether or not to ship an experiment lies with the executive team. The ranking teams measure the overall contribution of algorithm updates by having one experiment that “holds back” all changes over the year.

The essential question in experimentation transparency is: How can we share experiment data more widely without sacrificing privacy?

That type of disclosure helps us understand how decisions are made at a company and could be documented in platform transparency centers and annual reports. More specific disclosures, which offer more useful insight into decision-making, are also more likely to run the risk of divulging company secrets. These types of disclosures would include more about the intent of summary statistics, such as:

  • Which summary statistics are desirable, which are undesirable, and which are used as guard-rails (and shouldn’t change).
  • Specific formulas used to evaluate whether a decision is net good.
  • Lists of all experiments with hypotheses, dates, and decisions made.

Whether this is too detailed for a disclosure is up for debate and depends on the particular circumstances and goals for each product. But returning to the Twitter example and the oft-discussed “spam” problem, here’s a hypothetical scenario describing a useful disclosure: Let’s say Twitter ran 10 experiments targeting decreased spam prevalence. Each experiment was intended to measure whether changing the predictor of “clicking into a tweet” would reduce the number of users seeing spam. In those experiments, decreased spam reports were considered a desirable outcome, decreased replies were undesirable, and the number of retweets were used a guard-rail and expected to remain stable. Experiments one to five used larger, re-trained models predicting if a user would “click into a tweet.” Experiments six through 10 left the model unchanged but decreased the weight of click predictions in final ranking. The current production ranking model was used as a control group. All experiment variants began on May 20, involved experiment groups with 5 million users each, and ran for two weeks. Experiment seven, with a moderate decrease in weight, was approved by the product manager on June 10 and became the baseline experience.

A disclosure like that would help outsiders assess if Twitter is both actively trying to solve the spam problem and doing so with a sound approach. Transparency creates a risk of bad actors using information to adjust tactics, but it also holds ranking teams more accountable to their users and inspires more trust in how the user experience unfolds.

Open-source experimentation

While open-source methodology gives insight into a ranking team’s intent, it doesn’t allow external parties to understand the unintended consequences of ranking decisions. For that, we should examine open sourcing the experiment data itself.

Analyzing experiments requires having access to confidential information that’s only available to employees, such as individual user actions, e.g., “User A saw this video, watched it for 10 seconds and then liked it.” Comparing summary statistics of this information between test and control groups lets the company understand the algorithmic changes it makes. The essential question in experimentation transparency is: How can we share experiment data more widely without sacrificing privacy?

The most transparent version of open-source experimentation entails disclosing the raw information — every single person’s action in every experiment ever run. With that, external parties could draw proper, scientific conclusions about user behavior and content shifts in social media. But this amounts to a naive objective. Individual user actions are sensitive and personally revealing, and in some contexts they even risk lives. We should focus instead on achieving a level of transparency that doesn’t reveal sensitive information or violate consent but still enables other parties to study the results of experiments scientifically.

  • Limit the audience: Share raw experiment data to a smaller trusted group outside the company, such as a set of third-party algorithmic auditors that could be bound by professional regulations.
  • Individual disclosure: Allow users to see every experiment they have been exposed to.
  • Individual opt-in: Mitigate some privacy concerns by allowing individuals to choose to disclose their actions to specific groups, such as by allowing opt-in into monitored academic studies through in-app mechanisms. 
  • Summarization: Publish less sensitive information by bucketing experiment data into cohorts (e.g., disclose shifts in content distribution toward larger accounts, videos, specific countries, etc.). 

These approaches all give the tools of analysis to people who don’t work at social platforms and thus aren’t bound by company incentives. If we revisit the multi-year experiment I led on introducing Instagram’s ranking algorithm, having fresh eyes on the experiment group could have brought new perspectives to concerns such as whether ranking causes a filter bubble, whether introducing ranking causes a shift toward more political accounts, and whether people post more harmful content as a result of ranking. Without access to data, we are all stuck with incorrect reasoning based on headlines and anecdotes.

***

Despite the prevalence of algorithmic ranking models, their inner workings are not well understood — nor is that the point. Companies analyze the effects of algorithms by running experiments to decide if the changes they cause are net good for their content ecosystems.

Today, external parties, including the users who engage with these products every day, have no way of drawing conclusions about what is net good because experiment data is private and decision-making methodology is not disclosed. That doesn’t need to be the case: It’s possible to open up more of the decision-making methodology while preserving the ability for companies to compete. Information about experiments can be disclosed in a way that allows external parties to draw conclusions without sacrificing privacy.

Transparency is in itself an attribute, but meaningful transparency is the better goal. Going forward, let’s focus on opening up experiments, not algorithms.