Monday, February 15, 2016

Getting distinct values from MongoDB

MongoDB has a function called distinct that allows you get a list of all the distinct values in the given collection for a single key. Let's say that you wanted to see all of the distinct values for the key OffenseTeam in a collection called pbp_2015. You would use the following statement:

db.pbp_2015.distinct("OffenseTeam")

The result of this statement is an array with each distinct value for the given key. Your results would look something like this:

[
  "",
  "WAS",
  "BUF",
  "CLE",
  "NYJ",
  ⋮
  "JAC",
  "MIA",
  "SEA",
  "GB",
  "IND",
  "STL"
]

Note: results truncated for brevity.

You can also provide a query to this function to view the distinct values on a subset of the data.
db.pbp_2015.distinct("OffenseTeam", {"PlayType": "FIELD GOAL", "Quarter": 4, "Down": {"$ne": 4}})

The above shows us which teams attempted to kick a field goal in the fourth quarter of a game, even if it wasn't their final down. The results show that there were a total of 9 teams that did this in the 2015 football year.

[
  "JAC",
  "BAL",
  "NO",
  "TB",
  "CHI",
  "KC",
  "OAK",
  "NYJ",
  "MIN"
]

While this is great for getting distinct values for a single field, what do you do if you need to get a list of the distinct combination of two or more fields?

One option is to use the aggregation framework to get this information.

The following is an example:

db.pbp_2015.aggregate(
    [
        {
            "$match": {
                "PlayType": "FIELD GOAL",
                "Down": {"$ne": 4},
                "Quarter": 4
            }
        },
        {
            "$project": {
                "OffenseTeam": 1,
                "DefenseTeam": 1,
                "_id": 0
            }
        },
        {
            "$group": {
                "_id": {
                    "OffenseTeam": "$OffenseTeam",
                    "DefenseTeam": "$DefenseTeam"
                }
            }
        },
        {
            "$sort": {
                "_id": 1
            }
        }
    ]
)

The above aggregation is similar to the prior distinct command, you can see that we have the same match criteria ({"PlayType": "FIELD GOAL", "Down": {"$ne": 4}, "Quarter": 4}). This time however we want to see not only the offensive team, but the defensive team as well. Since those are the only fields we care about, we will $project them out. Next we will group on OffsenseTeam and DefenseTeam by combining them into compound _id key to get our list. Finally we'll sort documents on _id to make it easier to read the data.

The results look like the following:

{
  "waitedMS": NumberLong("0"),
  "result": [
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "PIT"}},
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "SD"}},
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "STL"}},
    {"_id": {"OffenseTeam": "CHI", "DefenseTeam": "DET"}},
    {"_id": {"OffenseTeam": "CHI", "DefenseTeam": "OAK"}},
    {"_id": {"OffenseTeam": "JAC", "DefenseTeam": "BAL"}},
    {"_id": {"OffenseTeam": "JAC", "DefenseTeam": "IND"}},
    {"_id": {"OffenseTeam": "KC", "DefenseTeam": "CHI"}},
    {"_id": {"OffenseTeam": "MIN", "DefenseTeam": "CHI"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "ATL"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "DAL"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "NYG"}},
    {"_id": {"OffenseTeam": "NYJ", "DefenseTeam": "NE"}},
    {"_id": {"OffenseTeam": "OAK", "DefenseTeam": "DEN"}},
    {"_id": {"OffenseTeam": "TB", "DefenseTeam": "HOU"}}
  ],
  "ok": 1
}

Here you can see that while there were 9 teams that kicked a field goal in the fourth quarter even though they weren't facing a fourth down at the time, several teams did this in multiple games (BAL, CHI, JAC, and NO).

As you can see, you can use the aggregation framework to get the distinct values over multiple keys. This information can be used for a variety of purposes such as the case we used it above. Another use would be to use this to determine the selectivity of keys in an index.

Data for the samples can be found here. The data was cleaned up using code from this MongoDB blog post.