Added progress report for second week

author: Priit Laes <plaes@plaes.org> 2010-06-09 20:38:07 +0300
committer: Priit Laes <plaes@plaes.org> 2010-06-09 20:38:07 +0300
commit: 9e43093fd353f98d791936fa3deeeec9da22cf27 (patch)
tree: 0842de8210e8057f8a6f0222d8b6889aa2c19c26
parent: Added utility for initial portage->database sync (diff)
download: gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.gz
gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.bz2
gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.zip
2 files changed, 122 insertions, 0 deletions
diff --git a/docs/gsoc/02-report.txt b/docs/gsoc/02-report.txt
new file mode 100644
index 0000000..debf139
--- /dev/null
+++ b/docs/gsoc/02-report.txt
@@ -0,0 +1,121 @@
+This is a weekly progress report no. 2 for Project Grumpy.
+
+As reported previously, I am building a system to index portage packages
+and related metadata to make package maintainership a bit easier for
+developers.
+
+First, a few words about the document metadata storage. For this project, the
+plan is to use a document-oriented and schema-free database (MongoDB) instead
+of a regular relational database system (like SQLite or PostgreSQL).
+
+This also means that we can create a single document collection, where
+documents correspond to simply "category/package" and collection containing
+whole ebuild tree.
+
+Document itself in the collection, is just a JSON-formatted dictionary with
+following structure (beware, this is work in progress, so some things are
+still missing)::
+
+	{
+		# "package/category" (primary index, unique)
+		'_id'			: string,
+
+		# Version of the schema, used internally (just in case)
+		'schema_ver'	: integer,
+
+		# Package category
+		'cat'			: string,
+
+		# Package name
+		'pkg'			: string,
+
+		## Data from metadata.xml
+		# List of herds maintaining this package
+		'herds'			: [ string, ... ],
+		# Long description of the package
+		'ldesc' 		: string,
+		# List of maintainers (by email addresses)
+		'maintainers' 	: [ string, ... ],
+
+		## Data from ebuilds itself (but should be general)
+		# Description
+		"desc"			: string,
+		# Upstream url(s) (FIXME: Do we need list here?)
+		'homepage'		: string,
+
+		# Array of all the package versions and their specific info 
+		'ebuilds' 	: [
+			# Package version (from category/package-version)
+		  	'version'	: string,
+
+			# Eapi version
+			"eapi" 		: integer,
+			# List of USE flags supported by this ebuild
+			'iuse'		: [ string, ... ],
+			# Package keywords ("x86", "~amd64", ...)
+			'keywords	: [ string, ... ],
+			# Licenses
+			'licence'	: [ string, ... ],
+			# Package slot
+		  	'slot'		: string,
+
+			# Need to figure out proper structure for these, so we can also
+			# map out USE flags ;)
+			'depend'	: TODO!!!
+			'rdepend'	: TODO!!!
+		]
+	}
+
+So how about querying the data? That's easy. (Please note we are using MongoDB
+shell). So, what if a developer wants to know which packages he is supposedly 
+maintaining::
+
+	> db.ebuilds.find({'maintainers' : '...@gentoo.org' })
+	{... document data ...} # (Too much info :) )
+	> db.ebuilds.find({'maintainers' : '...@gentoo.org' }).count()
+	7
+
+And the results come fast. I mean really fast.
+Ok, how about checking how many packages under 'dev-python' are using specific
+EAPI version::
+
+	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count()
+	202
+	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count()
+	3
+	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count()
+	255
+	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count()
+	125
+	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count()
+	0
+	> db.ebuilds.find({'cat' : 'dev-python' }).count()
+	504
+	> 202+3+255+125 - 504
+	81
+
+Ahem.. looks like we have a "design issue" with our document structure. So
+back to the drawing board.
+
+Last week's progress report
+===========================
+
+Last week's progress has been a bit slow, I have mostly played with document
+structure and played a bit with pkgcore's internals. Although I now have
+portage contents inside the database the document structure itself is far from
+ideal (as you can see from the example with EAPI counts given earlier).
+
+I have committed some of the stuff I have been working on into Grumpy's repo,
+so in case you are interested check it out from [1].
+
+[1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary
+
+First a warning, the portage->mongodb syncer is slow. I mean really slow - it
+takes about 3 hours (or even more) on my laptop to fully scan the contents of
+portage and store the data in database.
+
+Plans for current week 
+======================
+
+1) Speed up the portage syncer
+2) Improve document structure
diff --git a/utils/db_init.py b/utils/db_init.py
index c5d6e74..c71a5ef 100644
--- a/utils/db_init.py
+++ b/utils/db_init.py
@@ -43,6 +43,7 @@ def main(path):
            eapi = pkg.eapi,
            keywords = list(pkg.keywords) if pkg.keywords else [],
             # TODO, need to figure out a proper queryable structure for these
+#           iuse ??
 #           license = pkg.license,
 #           depends = pkg.depends,
 #           rdepends = pkg.rdepends
author	Priit Laes <plaes@plaes.org>	2010-06-09 20:38:07 +0300
committer	Priit Laes <plaes@plaes.org>	2010-06-09 20:38:07 +0300
commit	9e43093fd353f98d791936fa3deeeec9da22cf27 (patch)
tree	0842de8210e8057f8a6f0222d8b6889aa2c19c26
parent	Added utility for initial portage->database sync (diff)
download	gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.gz gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.bz2 gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.zip