class: title middle left # The edX Modulestore .subtitle[Storing edX Courses in MongoDB] .thirdtitle[
Boston MongoDB User Group: October 19th, 2015] .fourthtitle[edX Lunch 'n Learn: November 9th, 2017] --- # Who am I? **Julia Eskew** Principal Software Engineer - edX Core Platform Team
--- # Talk Summary * Introduction to edX * edX Modulestore * Problems Encountered --- # What is edX?
This is edX...
(Thanks, [nedbat!](https://github.com/nedbat)) --- # MongoDB University * Powered by OpenEdx * [https://university.mongodb.com/](https://university.mongodb.com/) --- # The edX codebase
This is the edX codebase.
--- .float-left.scale75[] --- # edX and MongoDB OpenEdx platform usage of MongoDB -- * Forums -- Ruby application integrated into the platform -- (I won't cover it in this talk.) -- * Course assets -- Uses GridFS to store binary assets. -- * Courses -- Via the edX modulestore. --- class: center # What is a "modulestore"? An interface to create and query course content, backed by storage. The storage used by edX: .float-left[] --- class: center, middle **Why does edX store courses in MongoDB?** --- # edX History -- - First course: MIT 6.002x -- - Content was coded directly into XML -- - Here's an XML course [example](assets/simple_course.xml). -- - Subsequent courses were XML-based courses -- - All XML courses loaded into memory at startup -- - Works fine for tens of courses - but not 100s... --- # XML course problems - Up until late-2015, edX still served all the original 41 XML courses. -- - Each process loaded each XML course into memory upon startup. -- - Each process consumed 745 MB of RAM. -- (edx.org runs a few hundred processes.) -- - Each process took around 10 minutes from startup until servicing users. -- Why? Because loading all the XML courses took that long. -- Without loading the XML courses, startup takes only 30 seconds. --- # Why MongoDB? The thinking: -- We need to load/unload course parts on-demand! -- So we'll use a database. -- Hmmm - we've got course information with a varied schema. -- It'd be nice to store the information without defining a schema. -- And to be able to query the course information irregardless of the schema. -- Maybe MongoDB? --- # Aside: How does MongoDB store info? -- - Documents -- Free-form JSON objects -- - Collections -- Named collection of documents -- Roughly equal to an RDBMS table... -- ```javascript db.inventory.insertOne( { item: "canvas", qty: 100, tags: ["cotton"], size: { h: 28, w: 35.5, uom: "cm" } } ) ``` --- ### edX Course .float-left.scale50[] Basically a directed acyclic graph (DAG)... --- # Draft Modulestore -- a.k.a "Old Mongo Modulestore". -- * Stores course modules (XModules). -- * On-demand loading of course modules. -- * Single collection to hold all courses. -- * Each document is a course module/block. --- # Draft Modulestore Let's look at some documents: * [Video Module](assets/old_mongo_video_doc.txt) * [HTML Module](assets/old_mongo_html_doc.txt) * [Problem Module](assets/old_mongo_problem_doc.txt) --- # Draft Modulestore Issues -- * No (1st class) re-run support -- Shipped without concept of the same course re-running in multiple semesters/terms. -- Significant work to change it and migrate existing courses. -- So the work was never done. -- Course re-runs supported by using a different "course" name. --- # Draft Modulestore Issues -- * No versioning -- A single draft branch and a single published branch -- Suppose you want to publish several course blocks. -- Those blocks are copied directly from the draft branch to the published branch. -- Non-atomic operation to live course material. -- Last published course version is forever lost. -- (Hindsight is 20/20)... --- # Draft Modulestore Issues -- Optimal for some query patterns -- Non-optimal for other query patterns --- # Draft Modulestore Issues -- "Return a list of all video modules in a course." -- Algorithm: -- * Start at the root course node. -- * Load all children of each node. -- * If child is video node, add to list and stop. -- * Else: repeat. --- # Draft Modulestore Issues "Return a list of all video modules in a course." Nearly all course modules must be queried. -- Many MongoDB "find()" queries of a small number of documents on each query. -- Sure - blocks can be cached (and they are). -- But caching adds complexity. --- class: center, middle **Time to re-design..** --- # Draft Versioning Modulestore a.k.a. Split Mongo Modulestore -- * Optimizes for frequent course traversal -- * Adds versioning -- * SPLITs out course structure from course content --- # Split Modulestore -- Three different collections: -- * modulestore.active_versions * modulestore.structures * modulestore.definitions --- .float-left[] --- # modulestore.structures -- Each document is an entire course structure. -- _Immutable_ once created! -- An example [structure](assets/split_structure.json) document --- # modulestore.definitions -- Contains _only_ the content of a course block -- No reference to a specific course -- No reference to content's location within a course -- An example [definition](assets/split_definition.txt) document --- # modulestore.active_versions -- One document per course -- Points to course structures -- * One draft branch structure -- * One published branch structure -- An example [active_versions](assets/split_active_versions.txt) document --- # Split Modulestore Query Example -- "Return a list of all video modules in a course." -- One query for the active_versions document by course id. -- One query for the course structure document by _id. -- Traverse the in-memory structure, finding all video modules. --- # Split Modulestore Course Editing - Details Suppose you edit a course - what happens?: -- 1. version_structure() -- * Queries existing structure from MongoDB -- * Deep-copies the existing structure -- * Updates its edit_info to tie it to the previous structure. --- .float-left[")] --- # Split Modulestore Course Editing - Details 1. version_structure() 2. Edit the new structure -- * Adding/deleting blocks -- or * Adding/deleting metadata -- or * Changing structure --- .float-left[] --- # Split Modulestore Course Editing - Details 1. version_structure() 2. Edit the new structure 3. update_structure() -- * Saves the new structure to MongoDB --- .float-left[] --- # Split Modulestore Course Editing - Details 1. version_structure() 2. Edit the new structure 3. update_structure() 4. update_course_index() -- * Updates the course index in the active_versions collection with the new structure ID --- .float-left[] --- # Problems We've Encountered -- Structures for large courses have become "large". -- * Over 1 MB. -- * Due to some module data being stored there -- (...which doesn't actually need to be there.) -- * Structures were designed to be "small" and queried often. -- * Not a MongoDB-specific issue - but... --- # Problems We've Encountered -- An old problem first... -- Externally-hosted MongoDB - compose.io -- * No (easy) access to logs. -- * Hard / impossible to tune for our usage. -- * Large structures and many structure queries at peak loads caused performance problems. Primarily NAT network overload. -- * To fix performance issues, we pickle, compress, and cache course structures in memcache. --- # Problems We've Encountered * Graph of NAT traffic before and after that optimization was deployed. .float-center[] --- # An aside: Block Structure Transformers -- * Added to edx-platform last year. -- * Caches course XBlocks - in a sense. -- * But goes beyond caching into "pre-building" XBlocks. -- * Reduces both: -- MongoDB queries -- XBlock instantiation processing time --- # Problems We've Encountered Split structure version storage -- * At launch, we had no plan for archiving old structure / definition versions. -- * We'd deal with it "later". -- * As stated before, structures became "large". -- * Every course edit copies the existing structure and creates a new one. -- * After several months, storage became an issue. --- # Problems We've Encountered Split structure version storage * MongoDB Usage Numbers (as of Oct 2015)
Collection
Documents
Size
modulestore.active_versions
1,033
680 KB
modulestore.definitions
692,202
3.47 GB
modulestore.structures
908,948
437 GB
--- # Problems We've Encountered Split structure version storage * MongoDB Usage Numbers (post-upgrade: MongoDB 3.0 with Wired Tiger)
Collection
Documents
Size
Wired Tiger Size
Space Reduction
modulestore.active_versions
1,033
680 KB
274 KB
2.5x
modulestore.definitions
692,202
3.47 GB
1.1 GB
3.2x
modulestore.structures
908,948
437 GB
109 GB
4x
--- # Problems We've Encountered Split structure version storage * However, as of last month (Oct 2017), the structure storage size had increased somewhat... -- 2 TB! (uncompressed...) -- * Dave Ormsbee and Microsoft (via OSPR) created a script to delete unused structures. -- * Now, structure storage is back down to: -- 300 GB (uncompressed...) --- # Problems We've Encountered Split Course Concurrent Edits -- * MongoDB does not support multi-document transactions. -- * Atomicity only at the single document level. -- * Scenario: Suppose multiple course authors are modifying the same course simultaneously. --- # Problems We've Encountered Split Course Concurrent Edits .float-left[] --- # Problems We've Encountered Split Course Concurrent Edits .float-left.scale80[] --- # Problems We've Encountered Split Course Concurrent Edits The Fix: -- * At update_course_index() time, check the current structure ID. -- * Is it the same structure ID from which my edited structure was derived? -- * Yes? No problem - good to go. -- * No? Edit conflict has occurred! -- Ask the author whether they want to merge, abort, or overwrite. --- # Problems We've Encountered Split Course Concurrent Edits The Confession: * The Open edX platform currently does *not* implement this fix. (sadface!) -- * It's possible to experience lost changes with multiple concurrent course authors. -- * NOTE: The changes aren't actually *lost*, just on a branch with no subsequent generations. -- * Perhaps YOU! would like to fix this problem. -- We accept Open Source Pull Requests! --- # Problems We've Encountered GridFS -- * Requirement: Binary assets can be stored & served to course students. -- * Requirement: Support asset "locking" so that only course registrants can access the assets. -- * Thinking: We're already using MongoDB to store courses. -- GridFS exists and can store binary data in MongoDB. -- Let's use it! -- * Reasonable thoughts at a glance... --- # Problems We've Encountered GridFS * The Root Problem: -- The application is reading the binary data from MongoDB via GridFS. -- And streaming it to each asset requestor. --- # Problems We've Encountered GridFS * Imagine this scenario: A course deadline approaches for a course that's popular around the globe, particularly in parts of the world that have only low bandwidth Internet connections. -- A large number of students request a large course asset (let's say 100 MB) at around the same time. -- Takes around 2.5 minutes to download a 100 MB file using mobile 3G speed (~6 Mbit/s). --- # Problems We've Encountered GridFS * Imagine this scenario: Each course asset download ties up an application worker for the entire download time. -- Less and less workers to serve other website traffic. -- RESULT: Spikes, alerts, alarming death spiral, postmortems, etc. --- # Problems We've Encountered GridFS * GridFS wasn't _really_ the problem here. -- * The edx-platform usage of GridFS was the real problem. -- * Nonetheless, a cautionary tale... -- * We have mitigated this problem somewhat by serving some course assets _externally_. Asset is retrieved from AWS Cloudfront - which uses LMS/GridFS as origin. -- ...but for "unlocked" assets only. --- # Thanks for your time! * This talk's slides can be accessed here: