Scoring

Note that Trusty is an experimental service, and will be evolving rapidly; new features will be added on a continuous basis.

The Trusty Score is an overall measure of the safety and trustworthiness of an open source package.

Trusty uses several factors, broken out into individual scores, which are averaged to render an overall Trusty Score. These factors are ones that most effectively indicate whether a package is good vs. malicious, based on statistical analysis of known malicious and “known good” packages.

Today, the individually calculated scores that contribute to a package’s overall Trusty Score include:

Repo Activity Score: A relative ranking of the level of activity within the package’s primary repository.
Author Activity Score: A relative aggregate rank of the top contributors to the repository.
Typosquatting: Indicates whether a package is likely to be a “typosquat,” or the practice of malicious actors who give their packages a slightly similar name to a reputable package, with the intention of tricking developers into installing a malicious package.
Repository affiliation (Shared Repositories): Indicates whether a package shares its source repository.
Proof of Origin (Provenance): Indicates the strength of the link between a published package and its source repository.
Malicious packages: If a package is known to be malicious (based on Datadog’s Malicious Software Packages dataset and OpenSSF’s Malicious Packages repo, it is automatically assigned a score of 1.

Scoring details

Trusty Score:

A package’s overall Trusty Score is an amalgamation of all the other metrics, and is presented as a ranking from 0-10. You can drill down into the Trusty Score by clicking on “View scoring details” to find out more about the individual scoring metrics.

The Trusty Score is used to:

Rank packages
Enforce policy decisions in Minder

The aggregated Trusty Score is a minimum of the other scores. The following reasoning explains why we’ve chosen to display this as a minimum, rather than an aggregate: Imagine you have a house with four closed doors and one open door. You score closed as 10 and open as 0. It is more useful to know that you have an open door (0), than to know your doors are mostly closed (8).

Repo and Author Activity Scores:

Repo and Author Activity scores are calculated using Principal Component Analysis (PCA).

To calculate these scores, we gather a number of numeric features from the public GitHub repo:

stargazers_count
forks
open_issues
watchers
contributor_count

And the contributors/authors:

public_repos
followers

We then “cut” these into bins of roughly equal population, and train a PCA model on them, giving us an even distribution of scores. These scores are then normalized into the 0-10 range.

The final activity score is the average between the contributors and repository scores.

  "activity": {
    "score": 7,
    "description": {
      "repo": 6.5,
      "user": 7.5
    },
    "updated_at": "2023-10-13 08:19:57.621460"
  },

Similar Names (Typosquatting)

Typosquatting is a practice in which malicious actors give their packages a slightly similar name to a reputable package, with the intention of tricking developers into installing a malicious package.

To determine whether a package may be a typosquat, for each package that we have indexed, we measure the Levenshtein distance for the for that package as compared to every other package. We can then find a list of packages with very similar names to the one we are searching on. This is with the hope of uncovering typosquatting attacks, but some names are legitimately similar.

To help solve for this, we use the same range as before, 0-10.

We first filter these similar names to remove all the low (< 5) scores, and then take the mean and standard deviation of the rest.
If the target package is above the mean plus standard deviation, we give it a high (10) score and don't consider it a typosquatting attack.
If the target is below the mean plus standard deviation, we give it a low (5) score and consider it a typosquatting attack.
If there are no similar names, we give it a high (10) score and don't consider it a typosquatting attack.

It looks like this:

  "similar_package_names": {
    "score": 7,
    "description": {[
         {
          "package_name": "request",
          "summary": {
            "score": 7.5,
            "description": {},
            "updated_at": "2023-10-13 11:25:38.302740"
          },
          "activity": {
            "score": 7.5,
            "description": {
              "repo": 6.2,
              "user": 8.8
            },
            "updated_at": "2023-10-13 11:22:51.570008"
          }
        },
    …
    ]},
    "updated_at": "2023-10-13 08:19:57.621460"  

There are many legitimate reasons for why there can be a higher-scoring package, so this process could flag up false positives. We rely on other factors like provenance to mark these as ok.

Provenance

“Provenance,” or proof of origin, refers to the origin or source of something. In the context of software, it means understanding where code comes from, who wrote it, who built it, and how it has been altered over time. This understanding is essential for maintaining the integrity and security of software systems.

For provenance scoring, we assign the following scores:

A score of 10 indicates that the package was signed and built with Sigstore and GitHub Actions. All Go packages also have a Provenance score of 10, because Go packages are imported via their source code URLs, providing a verifiable link back to the source code.
A score of 8 indicates strong historical provenance mapping from the package to its source repo.
A score of 5 indicates that the source repo does not have any Git tags, so we are unable to determine any link from the source repo to the published package.
A score of 2 indicates that the Git tags in the package’s listed source repo do not match the published versions of that package on the package manager registry.
- This could indicate that the package is malicious, or it could indicate that Git tags are being used for purposes other than denoting new version releases.

Package Alternatives

Trusty is not only able to assess the relative activity associated with a package you might consider using, but also recommends alternative packages from the community that offer similar capabilities. It relies on generative AI to recommend alternatives that we then rank (based on their Trusty scores) and present. It only presents alternatives that have positive community ranks. Our hope is that we help individuals avoid problematic packages (that are, for example, no longer actively being maintained) and instead focus on packages that have healthy and vibrant communities supporting them.

A Word on Transparency, Open Metrics and Playing Fair

Stacklok is committed to openness, transparency and a community centric model of operations. We built Trusty as a mechanism to showcase the importance of software proof-of-origin, and to create value for developers when such information is present. While the service is experimental, we believe it is important to 'show our hand' and explain our approach to our prospective community:

We believe in ‘finding truth in data’. Wherever practical we train impartial models on training sets of known good versus known compromised packages. We work hard to ensure that there are clear mathematical principles behind our scoring efforts. We will not fudge the numbers - not for ourselves (to make our own efforts look better) and not for anyone else. The numbers are the numbers. We respect that not everyone will agree with our approach, and we may well make mistakes that need to be corrected, but in the end we hope that everyone will agree that we are taking a fair hand.
We believe in moving quickly and experimenting. We are proud of what we have built, but we can see some really cool directions we want to take things in to make it more useful to the community. We started with Principal Component Analysis, but are exploring a lot of different approaches to modeling. With that in mind Trusty is being shipped with an Experimental tag. We want you to kick the tires and let us know what you think, and most importantly we want your feedback to make it more useful to you.
Transparency is critical. We do not believe it is okay to tell someone a package doesn’t look good without showing them what we consider good to be. We are going to strive to be transparent in how we generate metrics, but we also caution that the math is already a little complicated and likely to get a lot more complicated as we go. Also note that our appetite for transparency is somewhat at ends with point #2 (above), our need to evolve and get better quickly. We will do our best, but recognize that we will be moving quickly too.
We are deliberately separating Trusty and Minder. Minder is open source tooling intended to help communities build more securely. It is intended to support communities and help them generate better operating postures, and thence better Trusty scores. This will never be ‘pay for play’. You don’t have to use Minder to get better Trusty scores. It is just a tool to help.

Notes

1: Repo and author activity scores have a linear ranking system. A score of ‘5’ is the median, and ‘9’ would be the 90th percentile (i.e. better than 90% of known packages). A score of 0 to 1 represents the lowest 10% of packages by activity level.

Scoring

Scoring details​

Trusty Score:​

Repo and Author Activity Scores:​

Similar Names (Typosquatting)​

Provenance​

Package Alternatives​

A Word on Transparency, Open Metrics and Playing Fair​

Notes​