fedora-meeting
LOGS

20:00:29 <mmcgrath> #startmeeting
20:00:32 <dgilmore> gday mmcgrath
20:00:40 * ricky 
20:00:45 <mmcgrath> #topic Infrastructure -- Who's here?
20:00:51 * johe|home takes a seat
20:00:52 <mmcgrath> dgilmore: how's it going?
20:00:59 * SmootherFrOgZ is
20:01:08 * sijis sijis is here.
20:01:09 * ke4qqq is
20:01:12 <dgilmore> mmcgrath: 2 builders to go
20:01:36 <SmootherFrOgZ> dgilmore: for stg ?
20:01:37 <mmcgrath> dgilmore: excellent, happy to hear it.
20:01:38 <smooge> hello
20:01:42 <mmcgrath> Well lets get started
20:01:48 <mmcgrath> #topic Infrastructure -- Tickets
20:01:55 <mmcgrath> .tiny https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority
20:01:56 <zodbot> mmcgrath: http://tinyurl.com/47e37y
20:02:03 <mmcgrath> .ticket 1503
20:02:04 <mmcgrath> abadger1999: take it
20:02:07 <zodbot> mmcgrath: #1503 (Licensing Guidelines for apps we write) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/1503
20:02:17 <dgilmore> SmootherFrOgZ: nope
20:02:26 <abadger1999> So we've had a new license pop up in apps we've written recently
20:02:29 <abadger1999> AGPLv3+
20:02:47 <abadger1999> That's incompatible with GPLv2 which is what the majority of our apps use presently.
20:03:12 <abadger1999> After looking over the situation with spot, it seems like it would be good to move everything to AGPLv3+.
20:03:26 <dgilmore> im ok with the move
20:03:31 <abadger1999> (With libraries going to LGPLv2+)
20:03:35 <smooge> abadger1999, when you say we use.. do you mean we right or other stuff
20:03:42 <abadger1999> We write.
20:03:46 <smooge> s/right/write/
20:03:49 <smooge> thanks
20:04:01 <abadger1999> smooge: This would not affect code that we don't write.
20:04:12 <abadger1999> And it's a recommendation rather than a hard and fast rule.
20:04:31 <mmcgrath> abadger1999: have you run into anyone saying "ehh, I don't think we should do this." ?
20:04:35 <abadger1999> ie: mdomsch wants mirrormanager to be MIT; mediawiki plugins should follow mediawiki's license
20:04:47 <abadger1999> mmcgrath: So far everyone's been positive.
20:04:58 <mmcgrath> abadger1999: ok, so how do we actually _do_ it?
20:05:03 <mmcgrath> sed?
20:05:37 <abadger1999> yeah, we have to replace COPYING files with AGPL/LGPL and then change the headers in source files.
20:05:40 <smooge> well you need to look at each app and see if its something we wrote or pulled in from somewhere else
20:06:04 <sijis> do you need to get written proof from author before changing?
20:06:23 <ricky> How urgent is this time-wise?
20:06:24 <smooge> if its pulled in we need to deal with it.. if its something we wrote 100% we should be able to replace COPYING/headers
20:06:33 <abadger1999> sijis: for the majority of things no, but I am going to notify authors of pkgdb and python-fedora before I make chanes.
20:06:44 <mmcgrath> ricky: I'd say not real urgent, but the longer we wait... the longer we're going to wait I suspect.
20:06:47 <ricky> For example, with FAS, I'd like to eventually rewrite the OpenID provider part instead of dealing with licensing pain because of samadhi or anything.
20:06:53 <abadger1999> sijis: The CLA gives us the ability to do a relicense if the contribution was made without an explicit license.
20:07:21 <mmcgrath> abadger1999: some seemed timid about that on f-a-b.  I'm less timid.
20:07:22 <abadger1999> <nod> ricky the other option is to find out what jcollie thinks about AGPLv3+
20:07:28 <mmcgrath> but we should ask
20:07:43 <mmcgrath> abadger1999: lets take an app like fas first.
20:07:44 <mmcgrath> just see how it goes.
20:08:14 <abadger1999> yeah, it's common courtesy and also gives people a chancce to holler "Oh wait, I actually didn't own the copyright to that code.. sorry."
20:08:33 <mmcgrath> abadger1999: are you going to lead the effort on this?
20:08:37 <abadger1999> I'd like to do python-fedora soon It's moving to LGPLv2+ which is more permissive
20:08:42 <mmcgrath> should we open a ticket for each app?
20:08:43 <sijis> how many apps are we talking about for this? +/-15?
20:08:46 <abadger1999> mmcgrath: I can.  Yes, each app.
20:08:48 <mmcgrath> sijis: less then 15
20:08:52 <abadger1999> sijis: Less htan 15
20:09:25 <mmcgrath> abadger1999: sounds good, so anything else?
20:09:39 <abadger1999> A ticket for each app will let us come back next week and say -- half of our app authors like a licensing policy but don't want to change *their* app.
20:09:45 <abadger1999> Which would mean we need to rethink.
20:10:09 <abadger1999> I think that's all unless someone wants to shout that it's a bad idea now :-)
20:10:25 <mmcgrath> anyone have anything to say?  If not now, take it to the list.
20:10:27 <mmcgrath> and do it sooner, not later.
20:10:45 <mmcgrath> Ok, so next topic
20:10:54 <mmcgrath> #topic Infrastructure -- The merge, outages and issues.
20:11:00 <mmcgrath> So we had a merge last week.
20:11:07 <mmcgrath> and since the merge we've had some issues
20:11:13 <mmcgrath> and it's not something obvious.
20:11:16 <smooge> define merge for me?
20:11:21 <mmcgrath> and, in fact, could be completely unrelated.
20:11:29 <mmcgrath> smooge: merge from staging to master branches in puppet.
20:11:37 <ricky> smooge: We made a ton of changes in the staging branch and merged them to production :-)
20:11:46 <mmcgrath> Which basically involved refactoring a bunch of puppet code, cleaning things up, creating some new modules, etc, etc.
20:11:54 <mmcgrath> I've not seen a wiki outage since yesterday.
20:12:00 <mmcgrath> I need to go through the logs and look.
20:12:21 <mmcgrath> while doing some digging we, just in general, found strange issues in our environment.
20:13:04 <smooge> mmcgrath, ricky thanks..
20:13:20 <smooge> what have been the strange ones
20:13:27 <mmcgrath> for example - http://mmcgrath.fedorapeople.org/proxy-errors.html
20:13:51 <mmcgrath> 200,000+ 502's per day.
20:13:55 <mmcgrath> just seems massive to me.
20:14:00 <ricky> In terms of the big outages, they've all seemed to happen during mysql database backups (which lock tables) or smolt render stats jobs.
20:14:22 <ricky> The proxy errors and 500s seem to be something else though.
20:14:28 <mmcgrath> <nod>
20:14:40 <mmcgrath> and our current lead on the 500's errors for fas is a new mod_wsgi
20:14:44 <ricky> Have the 500 errors stayed normal?
20:14:44 <mmcgrath> jbowes is working on that.
20:15:03 <ricky> (As in, have they gone up after the merge or not?)
20:15:31 <mmcgrath> ricky: hard to say
20:15:46 <mmcgrath> http://mmcgrath.fedorapeople.org/JuneErrors.html
20:15:54 <mmcgrath> I'll re-check today now that it's been a few more days.
20:15:57 <mmcgrath> clearly we had a major spike
20:16:10 <sijis> mmcgrath: the first graph shows it being mostly proxy2
20:16:14 <mmcgrath> but it seems to have gone back down.
20:16:21 <ricky> Strange.
20:16:22 <mmcgrath> sijis: yeah, and proxy2 is an odd beast.
20:16:31 <mmcgrath> proxy2 is load balanced with proxy1 behind the PHX balancer.
20:16:35 <mmcgrath> _however_
20:16:45 <mmcgrath> anything in phx uses proxy2 directly to get to the account system.
20:16:51 <mmcgrath> which not only includes shell accounts.
20:17:03 <mmcgrath> but also includes our web applications contacting fas for session, auth, etc.
20:17:09 <mmcgrath> which is a significant amount of traffic.
20:17:18 <smooge> interesting.. is there a reason for just proxy2?
20:17:21 <ricky> Funny that proxy1 seems fine.
20:17:30 <mmcgrath> ricky: well it does get a lot less traffic.
20:17:34 <ricky> Like it didn't jump significantly at all.
20:17:39 <ricky> I guess.
20:17:41 <mmcgrath> smooge: the network team won't let us contact the balancer IP directly.
20:18:03 <sijis> so you are forced to pick a proxy?
20:18:05 <smooge> ah ok could we setup another proxy?
20:18:22 <mmcgrath> smooge: we have two of them there.
20:18:28 <mmcgrath> but no good way to balance between the two of them.
20:18:47 <mmcgrath> we could put a load balancer in there, but it'd be just another box, and would need to be rebooted as often as proxy2 is anyway
20:18:58 <ricky> Is the problem really coming from our PHX admin.fp.o setup though?
20:19:06 <smooge> mmcgrath, no what I meant was one that was just for that so we could cut down on what might be causing the erorrs?
20:19:12 <ricky> The 502s really jumped everywhere, so that's what I want to know the root cause of.
20:20:04 <smooge> so if its a bruteforce attack on stuff we could get an idea of what app is being targeted or soemthing
20:20:26 <mmcgrath> I think the errors are on our end, I need to do more log checking to know for sure though
20:20:29 <ricky> But the brute force shouldn't be causing 502, it should be working :-)
20:20:41 <mmcgrath> but yeah we can add and remove more proxy servers in PHX if we want to
20:20:58 <ricky> mmcgrath: Can we separate that graph into apache 502s and haproxy 502s?
20:21:10 <ricky> Right now they're lumped together in the source where you're getting it from, right?
20:21:37 <mmcgrath> ricky: I don't think so, because if haproxy or the app server returned a 502, apache would log a 502.
20:21:46 <mmcgrath> so proxyX will always have our largest number of 502's
20:21:55 <mmcgrath> then haproxy (if we're logging that, not even sure)
20:21:58 <mmcgrath> then the app server
20:22:14 <mmcgrath> although the app servers probably don't throw 502
20:22:17 <ricky> mmcgrath: But some 502s are coming from apache, as in proxy1 couldn't contact locahost:10009
20:22:32 <ricky> Those are the strangest ones to me.
20:22:42 <mmcgrath> I'll have to look closer then.
20:22:47 <sijis> firewall?
20:23:08 <ricky> sijis: I don't think so - it definitely works a large percent of the time
20:23:20 <mmcgrath> sijis: I'd actually think that's the app server not responding to haproxy, and thus not responding to the proxy server.
20:23:41 <ricky> But that should strictly cause haproxy 502s not apache 502s, correct?
20:23:41 <mmcgrath> and I'm not seeing us hitting our haproxy limit.
20:23:46 <ricky> and we've seen both :-(
20:23:55 <mmcgrath> ricky: when looking at the logs, how can you tell the difference?
20:24:15 <mmcgrath> oh from it saying it couldn't contact localhost:10009
20:24:21 <ricky> I'm not sure.  I'd expect the apache 502s to show up in the apache error log and both types of 502s to show up in the error log.
20:24:31 <ricky> I'll have to verify that tohugh.
20:24:32 <ricky> **though
01:45:44 * *though 
20:24:38 <mmcgrath> hm
20:24:39 <mmcgrath> hm
20:24:40 <mmcgrath> hmmmm
20:24:59 <ricky> Was your source for these graphs the error log or the access log?
20:25:09 <mmcgrath> acciess I believe
20:25:11 <sijis> is haproxy on a different server or on proxy2?
20:25:12 * mmcgrath looks
20:25:26 <mmcgrath> sijis: each proxy server has it's own haproxy service on the same host
20:25:51 <mmcgrath> ricky: access.log
20:26:09 <mmcgrath> perhaps we should continue discussing this after the meeting.
20:26:12 <ricky> Ah, OK.
20:26:15 <mmcgrath> any objections?
20:26:20 <ricky> Sure thing
20:26:42 <sijis> nope.
20:27:12 <mmcgrath> # topic Infrastructure -- Eye in know db.  - INNODB
20:27:22 <mmcgrath> #topic Infrastructure -- Eye in know db.  - INNODB
20:27:33 <mmcgrath> ricky: this one's you.  Talk about your plans, what's going on, what's going wrong, etc.
20:27:35 <smooge> is that a rock band?
20:27:40 <ricky> Any MySQL experts around, by the way?  :-)
20:27:50 <mmcgrath> ricky: abadger1999 is a mysql expert
20:27:52 <Jeff_S> ricky: for some definition of expert
20:27:53 <mmcgrath> :-P
20:28:17 <ricky> Part of the big outages we've seen since the merge seems to be due to mysql backups (and smolt's stats refresh script, which might be a separate problem)
20:28:36 <ricky> We've seen this behavior with the zabbix database, where the backup would lock entire tables
20:28:43 <abadger1999> ricky: Yep, of the yum erase '*ysql' ; yum install 'postgres*' variety
20:28:46 <ricky> abadger1999: Hehe
20:28:47 * mmcgrath notes we've always had a small problem with backups and outages.  But they've been tiny blips.  Lately they've been throwing nagios alerts.
20:29:33 <smooge> how many mysql databases do we have?
20:29:35 <ricky> We'd like to move to using the --single-transaction option to mysqldump, which combined with InnoDB, should make backups not lock the entire table
20:30:02 <Jeff_S> ricky: yes!
20:30:03 <ricky> THe main mysql usage we have is mediawiki, smolt, and zabbix
20:30:18 <ricky> Although we have a few others for stuff like cacti, prelude/prewikka, etc.
20:30:20 <Jeff_S> ricky: FWIW, we've also had good luck with http://www.zmanda.com/backup-mysql.html (community edition)
20:30:31 <smooge> ricky, are they seperate servers or one single one
20:30:49 <ricky> Jeff_S: Thanks, I'll take a look at that later
20:30:54 <ricky> smooge: They're all on db1
20:30:57 <mmcgrath> smooge: all mysql db's are on db1
20:31:06 <ricky> So far, the biggest pain we've had so far is the host_links table in smolt
20:31:16 <mmcgrath> ricky: and how big is it?
20:31:19 <mmcgrath> O:-)
20:31:25 <ricky> It has above 70M rows, and I haven't gotten a single successful conversion to InnoDB yet.
20:31:45 <ricky> And the thing with --single-transaction is that the tables need to be InnoDB to be sure that everything gets dumped in a consistent state
20:31:59 <Jeff_S> but single-transaction will probably solve your main problem of locking the table(s)
20:32:00 <abadger1999> ricky: We're able to dump that table?  Are we able to reload it except as innodb?
20:32:02 <smooge> wow thats quite a bit
20:32:07 <mmcgrath> ricky: and what are the downsides to innodb?  (space, etc, etc)
20:32:16 <abadger1999> slower
20:32:17 <Jeff_S> mmcgrath: slower at certain operations
20:32:22 <ricky> So the approaches that we've tried so far are: converting using alter table, and sedding a dump to change the table type, and loading it.
20:32:26 <mmcgrath> how much slower?
20:32:38 <ricky> The first didn't finish after some large number of hours, and the second is going now.
20:32:59 <ricky> mmcgrath: I'm actually not that sure about the downsides yet.  Apparently loading huge tables is a huge pain.
20:33:02 <mmcgrath> ricky: I'm going to want render-stats metrics too
20:33:05 <Jeff_S> mmcgrath: depends on the dataset & queries.  the locking though more than makes up for it IMO
20:33:22 <ricky> Also, some tables needed MyISAM for full text search - the only table affected by this is mediawiki's searchindex tables
20:33:32 <abadger1999> :-(
20:33:34 <ricky> (Which is just a copy of another InnoDB table, I believe)
20:33:45 <mmcgrath> ricky: and, in theory, we'll be able to get rid of that when we have a fedora search engine.
20:33:52 <ricky> Hopefully.
20:34:23 <ricky> Anyway, we'll probably have a mysql outage some time in the future once we get a successful test in staging.
20:34:39 <Jeff_S> mmcgrath: one of our past employees wrote this, I think it explains the reasons for using InnoDB pretty well http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB
20:34:39 <mmcgrath> ricky: yeah, how have the other conversions gone?
20:34:44 <ricky> what might be the case now is that maybe our configs aren't tuned for large innodb tables.
20:34:50 <smooge> ok what books/sites should I read to catch up how to help this. (DB's are not my specialty :/)
20:35:08 <ricky> mmcgrath: All of the other tables in the smolt db other than host_links have finished in <20 minutes
20:35:35 <ricky> Apart from the smolt db, most of the mediawiki db is already innodb
20:36:03 <ricky> The other databases that need conversions are: cacti, prelude._format, prewikka, and transifex (which isn't used anymore anyway)
20:36:04 <mmcgrath> ricky: I believe I went through and did some innodb conversions back in the day on some of those.
20:36:48 <ricky> prelude and prewikka are pretty much dispensable since that stuff is still being tested (lmacken even purged and recreated some of those dbs recently)
20:37:05 <mmcgrath> ricky: how big were those dumps?
20:37:32 <ricky> So smolt is basically the big hurdle - although I have some questoins about the smolt upgrade and the db changes there
20:37:45 <ricky> The dump of the smolt database is 2.5G
20:37:52 * lmacken looks at the time, and rolls in late
20:38:00 <mmcgrath> ricky:
20:38:00 <mmcgrath> alter table host modify column cpu_model varchar(80);
20:38:01 <mmcgrath> alter table host add column cpu_stepping int(11) DEFAULT NULL;
20:38:01 <mmcgrath> alter table host add column cpu_family int(11) DEFAULT NULL;
20:38:01 <mmcgrath> alter table host add column cpu_model_num int(11) DEFAULT NULL;
20:38:06 <mmcgrath> that's the smolt upgrade.
20:38:16 <ricky> mmcgrath: Oh, OK - that's no problem at all then.
20:38:38 <ricky> The host table took <20 minutes, so we can do that before or after, and it's fine
20:38:39 * mmcgrath doesn't really even know what "int(11)" means
20:38:44 <lmacken> have you guys been using SQLAlchemy-migrate for that stuff? or doing it by hand?
20:38:46 <mmcgrath> I need to look that up :)
20:39:05 <mmcgrath> lmacken: honestly I can't stand alchemy-migrate so I've been doing it by hand.
20:39:20 <lmacken> mmcgrath: heh.  I've never used it before
20:39:41 <mmcgrath> :)
20:39:47 <mmcgrath> ricky: ok, so anything else on the db front?
20:40:17 <ricky> Nope, but if anybody knows a lot about MySQL, let us know about your experiences with stuff like this
20:40:20 <ricky> Jeff_S: Thanks again for the links!
20:40:46 <mmcgrath> k
20:40:50 <Jeff_S> ricky: np.  I'm glad to have our current DBA lend a hand if needed
20:40:53 <mmcgrath> #topic Infrastructure -- Posse
20:41:05 <mmcgrath> So I haven't been as transparent with this as I should be
20:41:09 <mmcgrath> It's basically this
20:41:12 <mmcgrath> #link http://teachingopensource.org/index.php/POSSE_2009
20:41:21 <mmcgrath> we're providing some guests for a week for them to use.
20:41:42 <mmcgrath> +1 to open source :)
20:41:43 <ricky> Is it going to be on fasClient?  :-)
20:41:55 <mmcgrath> ricky: nope, they're completely disconnected atm.
20:42:06 <mmcgrath> this is their first time through this.
20:42:10 <ricky> Ah, OK
20:42:11 <mmcgrath> maybe next year.
20:42:16 <mmcgrath> but all of these guests are on cnode1
20:42:20 <mmcgrath> part of the cloud stuff.
20:42:22 <smooge> what servers are their guest on
20:42:27 <ricky> Hehe
20:42:29 <smooge> ah
20:42:31 <mmcgrath> I ended up not using osuosl1
20:42:46 <mmcgrath> since it's RHEL5 and for some reason xen+fedora 11 seems to be my white whale.
20:42:54 <mmcgrath> but cnode1 was F10, and using KVM worked just fine
20:43:05 <mmcgrath> Anyone have any other questions on that?
20:43:48 <mmcgrath> Ok
20:43:52 <mmcgrath> #topic Infrastructure -- Open Floor
20:43:58 <mmcgrath> anyone have anything they'd like to discuss?
20:44:16 <lmacken> I'm going to be deploying a new version of bodhi tonight/tomorrow to support EPEL :)
20:44:37 <lmacken> hopefully we'll be able to start queueing updates up tonight
20:44:38 <smooge> yeah
20:44:43 <lmacken> and ideally mashing repos tomorrow
20:45:26 <mmcgrath> lmacken: sounds good
20:45:33 <mmcgrath> and on a related note, I need to rebuild relepel1
20:45:43 * mmcgrath fail built it
20:45:57 <mmcgrath> anyone have anything else?
20:45:57 <mmcgrath> smooge: ?
20:46:20 <smooge> sorry
20:46:27 <smooge> keyboard problems
20:46:47 <smooge> I am checking to see what boxes need updates and I am working on seeing what ones I can do
20:46:56 <smooge> I should have that done by tonight/tomorrow.
20:47:16 <smooge> After that I am checking to see that func and puppet are working on the boxes
20:47:32 <smooge> and then finding out all the secret handshakes and such
20:47:41 <mmcgrath> heheh
20:47:43 <mmcgrath> fun times
20:47:56 <smooge> I should have the func done by friday and then it will be time to work on zabbix
20:48:03 <mmcgrath> smooge: excellent.
20:48:16 <mmcgrath> Ok, and with that if no one has anything else we'll close in 30
20:48:16 <smooge> zabbix will be next weeks project
20:48:16 <smooge> done
20:49:09 <mmcgrath> ok everyone, thanks for coming!
20:49:12 <mmcgrath> #endmeeting