diff options
author | Priit Laes <plaes@plaes.org> | 2010-06-09 20:38:07 +0300 |
---|---|---|
committer | Priit Laes <plaes@plaes.org> | 2010-06-09 20:38:07 +0300 |
commit | 9e43093fd353f98d791936fa3deeeec9da22cf27 (patch) | |
tree | 0842de8210e8057f8a6f0222d8b6889aa2c19c26 | |
parent | Added utility for initial portage->database sync (diff) | |
download | gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.gz gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.bz2 gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.zip |
Added progress report for second week
-rw-r--r-- | docs/gsoc/02-report.txt | 121 | ||||
-rw-r--r-- | utils/db_init.py | 1 |
2 files changed, 122 insertions, 0 deletions
diff --git a/docs/gsoc/02-report.txt b/docs/gsoc/02-report.txt new file mode 100644 index 0000000..debf139 --- /dev/null +++ b/docs/gsoc/02-report.txt @@ -0,0 +1,121 @@ +This is a weekly progress report no. 2 for Project Grumpy. + +As reported previously, I am building a system to index portage packages +and related metadata to make package maintainership a bit easier for +developers. + +First, a few words about the document metadata storage. For this project, the +plan is to use a document-oriented and schema-free database (MongoDB) instead +of a regular relational database system (like SQLite or PostgreSQL). + +This also means that we can create a single document collection, where +documents correspond to simply "category/package" and collection containing +whole ebuild tree. + +Document itself in the collection, is just a JSON-formatted dictionary with +following structure (beware, this is work in progress, so some things are +still missing):: + + { + # "package/category" (primary index, unique) + '_id' : string, + + # Version of the schema, used internally (just in case) + 'schema_ver' : integer, + + # Package category + 'cat' : string, + + # Package name + 'pkg' : string, + + ## Data from metadata.xml + # List of herds maintaining this package + 'herds' : [ string, ... ], + # Long description of the package + 'ldesc' : string, + # List of maintainers (by email addresses) + 'maintainers' : [ string, ... ], + + ## Data from ebuilds itself (but should be general) + # Description + "desc" : string, + # Upstream url(s) (FIXME: Do we need list here?) + 'homepage' : string, + + # Array of all the package versions and their specific info + 'ebuilds' : [ + # Package version (from category/package-version) + 'version' : string, + + # Eapi version + "eapi" : integer, + # List of USE flags supported by this ebuild + 'iuse' : [ string, ... ], + # Package keywords ("x86", "~amd64", ...) + 'keywords : [ string, ... ], + # Licenses + 'licence' : [ string, ... ], + # Package slot + 'slot' : string, + + # Need to figure out proper structure for these, so we can also + # map out USE flags ;) + 'depend' : TODO!!! + 'rdepend' : TODO!!! + ] + } + +So how about querying the data? That's easy. (Please note we are using MongoDB +shell). So, what if a developer wants to know which packages he is supposedly +maintaining:: + + > db.ebuilds.find({'maintainers' : '...@gentoo.org' }) + {... document data ...} # (Too much info :) ) + > db.ebuilds.find({'maintainers' : '...@gentoo.org' }).count() + 7 + +And the results come fast. I mean really fast. +Ok, how about checking how many packages under 'dev-python' are using specific +EAPI version:: + + > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count() + 202 + > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count() + 3 + > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count() + 255 + > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count() + 125 + > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count() + 0 + > db.ebuilds.find({'cat' : 'dev-python' }).count() + 504 + > 202+3+255+125 - 504 + 81 + +Ahem.. looks like we have a "design issue" with our document structure. So +back to the drawing board. + +Last week's progress report +=========================== + +Last week's progress has been a bit slow, I have mostly played with document +structure and played a bit with pkgcore's internals. Although I now have +portage contents inside the database the document structure itself is far from +ideal (as you can see from the example with EAPI counts given earlier). + +I have committed some of the stuff I have been working on into Grumpy's repo, +so in case you are interested check it out from [1]. + +[1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary + +First a warning, the portage->mongodb syncer is slow. I mean really slow - it +takes about 3 hours (or even more) on my laptop to fully scan the contents of +portage and store the data in database. + +Plans for current week +====================== + +1) Speed up the portage syncer +2) Improve document structure diff --git a/utils/db_init.py b/utils/db_init.py index c5d6e74..c71a5ef 100644 --- a/utils/db_init.py +++ b/utils/db_init.py @@ -43,6 +43,7 @@ def main(path): eapi = pkg.eapi, keywords = list(pkg.keywords) if pkg.keywords else [], # TODO, need to figure out a proper queryable structure for these +# iuse ?? # license = pkg.license, # depends = pkg.depends, # rdepends = pkg.rdepends |