Showing posts with label aggregation framework. Show all posts
Showing posts with label aggregation framework. Show all posts

Monday, February 15, 2016

Getting distinct values from MongoDB

MongoDB has a function called distinct that allows you get a list of all the distinct values in the given collection for a single key. Let's say that you wanted to see all of the distinct values for the key OffenseTeam in a collection called pbp_2015. You would use the following statement:

db.pbp_2015.distinct("OffenseTeam")

The result of this statement is an array with each distinct value for the given key. Your results would look something like this:

[
  "",
  "WAS",
  "BUF",
  "CLE",
  "NYJ",
  ⋮
  "JAC",
  "MIA",
  "SEA",
  "GB",
  "IND",
  "STL"
]

Note: results truncated for brevity.

You can also provide a query to this function to view the distinct values on a subset of the data.
db.pbp_2015.distinct("OffenseTeam", {"PlayType": "FIELD GOAL", "Quarter": 4, "Down": {"$ne": 4}})

The above shows us which teams attempted to kick a field goal in the fourth quarter of a game, even if it wasn't their final down. The results show that there were a total of 9 teams that did this in the 2015 football year.

[
  "JAC",
  "BAL",
  "NO",
  "TB",
  "CHI",
  "KC",
  "OAK",
  "NYJ",
  "MIN"
]

While this is great for getting distinct values for a single field, what do you do if you need to get a list of the distinct combination of two or more fields?

One option is to use the aggregation framework to get this information.

The following is an example:

db.pbp_2015.aggregate(
    [
        {
            "$match": {
                "PlayType": "FIELD GOAL",
                "Down": {"$ne": 4},
                "Quarter": 4
            }
        },
        {
            "$project": {
                "OffenseTeam": 1,
                "DefenseTeam": 1,
                "_id": 0
            }
        },
        {
            "$group": {
                "_id": {
                    "OffenseTeam": "$OffenseTeam",
                    "DefenseTeam": "$DefenseTeam"
                }
            }
        },
        {
            "$sort": {
                "_id": 1
            }
        }
    ]
)

The above aggregation is similar to the prior distinct command, you can see that we have the same match criteria ({"PlayType": "FIELD GOAL", "Down": {"$ne": 4}, "Quarter": 4}). This time however we want to see not only the offensive team, but the defensive team as well. Since those are the only fields we care about, we will $project them out. Next we will group on OffsenseTeam and DefenseTeam by combining them into compound _id key to get our list. Finally we'll sort documents on _id to make it easier to read the data.

The results look like the following:

{
  "waitedMS": NumberLong("0"),
  "result": [
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "PIT"}},
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "SD"}},
    {"_id": {"OffenseTeam": "BAL", "DefenseTeam": "STL"}},
    {"_id": {"OffenseTeam": "CHI", "DefenseTeam": "DET"}},
    {"_id": {"OffenseTeam": "CHI", "DefenseTeam": "OAK"}},
    {"_id": {"OffenseTeam": "JAC", "DefenseTeam": "BAL"}},
    {"_id": {"OffenseTeam": "JAC", "DefenseTeam": "IND"}},
    {"_id": {"OffenseTeam": "KC", "DefenseTeam": "CHI"}},
    {"_id": {"OffenseTeam": "MIN", "DefenseTeam": "CHI"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "ATL"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "DAL"}},
    {"_id": {"OffenseTeam": "NO", "DefenseTeam": "NYG"}},
    {"_id": {"OffenseTeam": "NYJ", "DefenseTeam": "NE"}},
    {"_id": {"OffenseTeam": "OAK", "DefenseTeam": "DEN"}},
    {"_id": {"OffenseTeam": "TB", "DefenseTeam": "HOU"}}
  ],
  "ok": 1
}

Here you can see that while there were 9 teams that kicked a field goal in the fourth quarter even though they weren't facing a fourth down at the time, several teams did this in multiple games (BAL, CHI, JAC, and NO).

As you can see, you can use the aggregation framework to get the distinct values over multiple keys. This information can be used for a variety of purposes such as the case we used it above. Another use would be to use this to determine the selectivity of keys in an index.

Data for the samples can be found here. The data was cleaned up using code from this MongoDB blog post.

Saturday, April 05, 2014

MongoDB Aggregation for Justin John Mathews - Part 2

Recently on LinkedIn (and as a follow up to a previous post, Justin John Mathews asked a question about being able to return data in a certain format. The input and output records are below. Input
db.users.insert({ courseId: 1, stDt: new Date(2014, 01, 01), endDt: new Date(2014, 01, 20), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 01, 25), endDt: new Date(2014, 02, 10), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 25), endDt: new Date(2014, 03, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 28), endDt: new Date(2014, 06, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 25), endDt: new Date(2014, 02, 30), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 05, 25), endDt: new Date(2014, 10, 30), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2013, 10, 01), endDt: new Date(2014, 08, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 09, 01), endDt: new Date(2014, 11, 30), active: false });
Output (this is my interpretation as the format he gave was not a valid document)
{
    "result" : {
        "past" : [
            { "endDt" : ISODate("2014-02-20T07:00:00Z"), "active" : false },
            { "endDt" : ISODate("2014-03-10T06:00:00Z"), "active" : false },
            { "endDt" : ISODate("2014-03-30T06:00:00Z"), "active" : false }
        ],
        "current" : [
            { "endDt" : ISODate("2014-04-10T06:00:00Z"), "active" : true },
            { "endDt" : ISODate("2014-07-10T06:00:00Z"), "active" : true },
            { "endDt" : ISODate("2014-09-10T06:00:00Z"), "active" : true }
        ],
        "upcoming": [
            { "endDt" : ISODate("2014-11-30T07:00:00Z"), "active" : false },
            { "endDt" : ISODate("2014-12-30T07:00:00Z"), "active" : false }
        ]
    }
}

As you can see the dates are broken up into three different categories of past, current and upcoming. After talking with Justin, he would like to see the stDt value as well and he doesn't need the active field any longer. Each category is sorted by the endDt. A course is past if the endDt is less than the current date. It is active if the stDt is less than the current date and the endDt is greater than the current date. It is upcoming if the stDt is greater than the current date.

This is the code that I came up with. It does most of what he's looking for, but no matter what I did, I couldn't get the final grouping to display in the sorted order he was looking for in all categories. I'm sure I'm missing something simple, but when printing out the sorted list before the final grouping everything looks fine, but for some reason it's not getting added to the array in the order that I would imagine it should have been.

db.users.aggregate([
    {
        "$project": {
            "range": {
                "stDt": "$stDt",
                "endDt": "$endDt"
            },
            "_id": 0
        }
    },
    {
        "$group": {
            "_id": null,
            "past": {
                "$addToSet": {
                    "$cond": [
                        {"$lt": ["$range.endDt", new Date()]},
                        "$range",
                        null
                    ]
                }
            },
            "current": {
                "$addToSet": {
                    "$cond": [
                        {
                            "$and": [
                                {"$gt": ["$range.endDt", new Date()]},
                                {"$lt": ["$range.stDt", new Date()]},
                            ]
                        },
                        "$range",
                        null
                    ]
                }
            },
            "upcoming": {
                "$addToSet": {
                    "$cond": [
                        {"$gt": ["$range.stDt", new Date()]},
                        "$range",
                        null
                    ]
                }
            }
        }
    },
    {"$unwind": "$past"},
    {"$unwind": "$current"},
    {"$unwind": "$upcoming"},
    {
        "$match": {
            "past": {"$ne": null},
            "current": {"$ne": null},
            "upcoming": {"$ne": null}
        }
    },
    {
        "$sort": {
            "past.endDt": -1,
            "current.endDt": -1,
            "upcoming.endDt": -1
        }
    },
    {
        "$group": {
            "_id": null,
            "past": {"$addToSet": "$past"},
            "current": {"$addToSet": "$current"},
            "upcoming": {"$addToSet": "$upcoming"}
        }
    },
    {
        "$project": {
            "past": 1,
            "current": 1,
            "upcoming": 1,
            "_id": 0
        }
    },
])
And here is the output you get in MongoDB 2.2.x or 2.4.x.
{
    "result" : [
        {
            "past" : [
                {
                    "stDt" : ISODate("2014-02-25T07:00:00Z"),
                    "endDt" : ISODate("2014-03-10T06:00:00Z")
                },
                {
                    "stDt" : ISODate("2014-02-01T07:00:00Z"),
                    "endDt" : ISODate("2014-02-20T07:00:00Z")
                },
                {
                    "stDt" : ISODate("2014-03-25T06:00:00Z"),
                    "endDt" : ISODate("2014-03-30T06:00:00Z")
                }
            ],
            "current" : [
                {
                    "stDt" : ISODate("2014-03-25T06:00:00Z"),
                    "endDt" : ISODate("2014-04-10T06:00:00Z")
                },
                {
                    "stDt" : ISODate("2014-03-28T06:00:00Z"),
                    "endDt" : ISODate("2014-07-10T06:00:00Z")
                },
                {
                    "stDt" : ISODate("2013-11-01T06:00:00Z"),
                    "endDt" : ISODate("2014-09-10T06:00:00Z")
                }
            ],
            "upcoming" : [
                {
                    "stDt" : ISODate("2014-06-25T06:00:00Z"),
                    "endDt" : ISODate("2014-11-30T07:00:00Z")
                },
                {
                    "stDt" : ISODate("2014-10-01T06:00:00Z"),
                    "endDt" : ISODate("2014-12-30T07:00:00Z")
                }
            ]
        }
    ],
    "ok" : 1
}

This works well for the small test set you gave, but it might not be that great for larger datasets. You'll need to test that before putting this code into production. If nothing else this gives you an idea of how you can do things with the aggregation framework.

Tuesday, April 01, 2014

MongoDB Aggregation Query for Justin John Mathews

Recently on LinkedIn, Justin John Mathews asked a question about being able to return data in a certain format. The input and output records are below. Input
db.users.insert({ courseId: 1, stDt: new Date(2014, 01, 01), endDt: new Date(2014, 01, 20), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 01, 25), endDt: new Date(2014, 02, 10), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 25), endDt: new Date(2014, 03, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 28), endDt: new Date(2014, 06, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 02, 25), endDt: new Date(2014, 02, 30), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2014, 05, 25), endDt: new Date(2014, 10, 30), active: false });
db.users.insert({ courseId: 1, stDt: new Date(2013, 10, 01), endDt: new Date(2014, 08, 10), active: true });
db.users.insert({ courseId: 1, stDt: new Date(2014, 09, 01), endDt: new Date(2014, 11, 30), active: false });
Output
{
    "result" : {
        true : [
            { endDt: new Date(2014, 03, 10), active: true },
            { endDt: new Date(2014, 06, 10), active: true },
            { endDt: new Date(2014, 08, 10), active: true },
        ],
        false: [
            { endDt: new Date(2014, 01, 20), active: false },
            { endDt: new Date(2014, 02, 10), active: false },
            { endDt: new Date(2014, 02, 30), active: false },
            { endDt: new Date(2014, 10, 30), active: false },
            { endDt: new Date(2014, 11, 30), active: false }
        ]
    }
}
The code below is one possible solution. I'm not necessarily happy with everything yet, but it does produce the same results as requested, with two small exceptions. First, the code below displays the ISODate instead of the JavaScript date method which I'm assuming is what should have been in the request as you can't return what's shown above. The other small difference is that since I'm using the aggregation framework with MongoDB 2.4.9, the result is a key of result which contains an array of subdocuments as it's value. There were quite a bit of hoops I jumped through to get this solution, and it could probably be cleaned up quite a bit, but I thought I would throw this post together to let him see that it can be done with a single command if you're patient enough to keep tweaking your pipeline. ;)
db.users.aggregate([

    // Project only what we need.
    {
        "$project": {
            "endDt": 1,
            "active": 1,
            "_id": 0
        }
    },

    // Group to true and false buckets. This will give us arrays with null.
    // We'll remove them in a bit.
    {
        "$group": {
            "_id": "$active",
            "true": {
                "$addToSet": {
                    "$cond": [
                        {"$eq": ["$active", true]},
                        "$endDt",
                        null
                    ]
                }
            },
            "false": {
                "$addToSet": {
                    "$cond": [
                        {"$eq": ["$active", false]},
                        "$endDt",
                        null
                    ]
                }
            }
        }
    },

    // We need to unwind our arrays so we can build them back up without the
    // "null" array.
    {"$unwind": "$true"},
    {"$unwind": "$false"},

    // Project out the values. This will give both a true and false key for
    // each item. This builds our arrays up with the proper endDt and
    // active values.
    {
        "$project": {
            "true": {"endDt": "$true", "active": "$_id"}, 
            "false": {"endDt": "$false", "active": "$_id"}
        }
    },

    // Project out a single value to clean up the "issue" a couple steps above.
    {
        "$project": {
            "value": {
                "$cond": [
                    {"$eq": ["$_id", true]},
                    "$true",
                    "$false"
                ]
            }
        }
    },

    // Group things up again to rebuild our arrays.
    // This adds a single "null" entry that will need to be cleaned up.
    {
        "$group": {
            "_id": null,
            "true": {
                "$addToSet": {
                     "$cond": [
                         {"$eq": ["$_id", true]},
                         "$value",
                         null
                     ]
                }
            },
            "false": {
                "$addToSet": {
                    "$cond": [
                        {"$eq": ["$_id", false]},
                        "$value",
                        null
                    ]
                }
            }
        }
    },

    // Unwind our arrays again so we can clean up one more time.
    {"$unwind": "$true"},
    {"$unwind": "$false"},

    // Match only documents where true and false are not null.
    {
        "$match": {
            "true": {"$ne": null},
            "false": {"$ne": null}
        }
    },

    // Sort our items so we can add the to the array in the correct order.
    // I'm not sure why it has to be descending order, but it works.
    {
        "$sort": {
            "true.endDt": -1,
            "false.endDt": -1
        }
    },

    // Group again to build our array.
    {
        "$group": {
            "_id": null,
            "true": {"$addToSet": "$true"},
            "false": {"$addToSet": "$false"}
        }
    },

    // Once again project out just the fields we need
    {
        "$project": {
            "true": 1,
            "false": 1,
            "_id": 0
        }
    }
])
Here are the returned results in MongoDB 2.2.x - 2.4.x:
{
    "result" : [
        {
            "true" : [
                {
                    "endDt" : ISODate("2014-04-10T06:00:00Z"),
                    "active" : true
                },
                {
                    "endDt" : ISODate("2014-07-10T06:00:00Z"),
                    "active" : true
                },
                {
                    "endDt" : ISODate("2014-09-10T06:00:00Z"),
                    "active" : true
                }
            ],
            "false" : [
                {
                    "endDt" : ISODate("2014-02-20T07:00:00Z"),
                    "active" : false
                },
                {
                    "endDt" : ISODate("2014-03-10T06:00:00Z"),
                    "active" : false
                },
                {
                    "endDt" : ISODate("2014-03-30T06:00:00Z"),
                    "active" : false
                },
                {
                    "endDt" : ISODate("2014-11-30T07:00:00Z"),
                    "active" : false
                },
                {
                    "endDt" : ISODate("2014-12-30T07:00:00Z"),
                    "active" : false
                }
            ]
        }
    ],
    "ok" : 1
}
And in the soon to be released version 2.6.0, the results look like the following:
{
    "true" : [
        {
            "endDt" : ISODate("2014-04-10T06:00:00Z"),
            "active" : true
        },
        {
            "endDt" : ISODate("2014-07-10T06:00:00Z"),
            "active" : true
        },
        {
            "endDt" : ISODate("2014-09-10T06:00:00Z"),
            "active" : true
        }
    ],
    "false" : [
        {
            "endDt" : ISODate("2014-02-20T07:00:00Z"),
            "active" : false
        },
        {
            "endDt" : ISODate("2014-03-10T06:00:00Z"),
            "active" : false
        },
        {
            "endDt" : ISODate("2014-03-30T06:00:00Z"),
            "active" : false
        },
        {
            "endDt" : ISODate("2014-11-30T07:00:00Z"),
            "active" : false
        },
        {
            "endDt" : ISODate("2014-12-30T07:00:00Z"),
            "active" : false
        }
    ]
}