On the Soapbox

Smarter Spammers

Tuesday, July 18, 2006
Keywords: kBlog, Technology

Sigh.

Blog spam used to be only a minor nuisance. From the very beginning, there were attempts at comment spam, as indicated by my server logs. Fortunately, incompetent spamming software coupled with a bit of security by obscurity (since I'm like the only person using the kBlog blogging platform) shielded me. Of course, that didn't hold for long, since not all spamming bots are so incompetently written...

But even then, it was easy to deal with, since all I needed to do was filter by technical heuristics, such as the use of HTTP/1.0 (commonly used by bots/scripts, but not by real browsers), whether redirects are properly followed, and whether auxiliary files like CSS and images are accessed (as real browser would do, but not most bots). Well, at least, these filtering heuristics used to work.

These bots are now smart enough to emulate real browsers in every way, from the use of HTTP/1.1 to the downloading of images and CSS files. Also, in the past few days, I've been hammered by comment spammers (they used to come by only occassionally). The spam would come in bursts, and during these bursts, the rate of attempts could be as high as one per second. This leaves me in the undesirable position of being forced to address comment spamming through content filtering. And we all know what a hornet's nest that is...

The Joys of Blogging

Friday, February 17, 2006
Keywords: Blogging, Me, kBlog, Potpourri

I think that I have been somewhat sheltered from the world these past few years. I used hungrily consume news and keep up with the world in high school, but I stopped doing that at HMC, partly because of time and also partly because the place oozed apathy. Also during these years, I had completely missed out on the rise of the blog, despite my usual desire* to keep myself atop the crest of technology. It wasn't that I didn't have a blog; since 1998, my website has always featured a blog-like section where I would post my latest ramblings on a somewhat regular basis (though it was somewhat different structurally from the canonical blog of today). No, it was because I simply did not believe in it. Over half a decade ago, people were rushing to get Blogger accounts, "blog" was starting to turn into the latest new buzzword/hype, and places like LiveJournal were brimming with people posting daily details of their personal life (it took me a while to finally disassociate blogging from LJ-esque diaries). All this left a me with a sour bias, which is why I never even referred to my now-defunct second-generation blog (which was an awkward mix of a heavily sanitized diary plus some dull commentary) on my old website as a "blog" and why I never paid much heed to the growth of the "blogosphere", both as a word and as the thing itself. Ultimately, I was oblivious to the blog...

...until now. And boy, is the blogosphere addictive or what? There are many precious nuggets that I read every day, such as this little excerpt (source) that I read today about what it means to be a moderate:

Perhaps the best definition of a moderate is someone who does not derive all of their political opinions from one or two first principles and stick to them no matter where that may lead them. Those first principles may be relatively crude ("the moral environment that prevailed in the 1950s should be held onto") or fairly sophisticated ("we must maximize the power of the weak over the strong"), but regardless of their origin, they tend to make people into extremely rigid voters. People who see themselves as trading off a whole bunch of values, will have political opinions that are in general less extreme. They will also be more tolerant of other peoples' viewpoints, because they tend to assume that other people are simply weighting different values differently--rather than concluding that the difference of opinion must be caused by some terrible moral failing on the part of others.

I now have about 20 feeds (and quickly growing) in my RSS aggregator. But perhaps most importantly for me, blogging has allowed me to reemerge from the sheltered bubble of HMC's apathy and reconnect with my old self. I enjoy reading a variety of perspectives and insights on the affairs of the world, and my blog has become a delightful outlet for a lot of my own thoughts. Instead of letting the thoughts that occupy my mind during mundane tasks like showering, brushing, and eating evaporate or get lost on the countless pieces of scrap paper that litter my archives (yes, I do think about free markets in the shower; call me a freak), I can now preserve and express them here. I know I don't have much of an audience here, but that doesn't matter because this is mostly for me, and if I get an audience, that would just be a nice bonus. :)

I thoroughly love this, and I only wish that I had started this blog years ago and that I had paid more attention to the rich blogosphere. Of course, that is not to say that the blogosphere is entirely good; most of the blogs are not that interesting or well-written (that probably includes mine), and most of them are not very thoughtful or carry too much bias of dogma (see the above excerpt on moderation or my own rant about lack of moderation), but there are enough gems out there that all this is much more satisfying than watching yet another movie or the other mundane things that I could do to fill my free time.

Anyway, that's enough of me gabbing on about this topic. I've been blogging and reading blogs for nearly a month now, and I've learned a lot. For one, I have learned that my features wish-list for the blogging system that I use has grown a bit lot. Remember what I said** about kBlog 0.1.0 being nearly feature-complete? Never let a non-blogger decide what features would be nice. I'm currently in the middle of a major overhaul (most of it is stuff that visitors won't notice) of kBlog (version 0.3.0), so that will occupy my free time for a few days, and I'll probably end up doing a couple more versions to add new features after that before finally going for 1.0. I've considered moving to one of the mainstream blogging software packages, but the hassle of installing, configuring, and migrating to something like Wordpress is simply not worth it (and I hate reading manuals), especially since the hassle of hacking up Perl to add some pet features to kBlog would probably about the same, and, most importantly, I'm too comfortable with the control and flexibility of using an in-house system, and for a control freak like me, that means something. ;)

* I started browsing the WWW back in the days of Netscape 1. I used VoIP for the first time in the 90's when AIM added voice to one of its betas. I got a Hotmail account back when it was the Gmail of its era, before it was bought by Microsoft. I first encountered Google back when it was a little-known beta with a very, very crappy-looking logo. I've installed Mozilla-based browsers since Mozilla 0.6, and used Firefox long before it was named Firefox. So I would at least like to think that I keep up with technology. ;)

** That reminds me, I never did get around to posting the kBlog source code. Oh well. I'll post the source when 0.3.0 is finished.

Time/Date Format

Monday, January 23, 2006
Keywords: kBlog, Technology

I'm genuinely curious about this: what the heck were they thinking in RFCs 822 and 2822 when they set the format for date-time? Why does the format look like "Sun, 20 Oct 2002 23:47:15 GMT" instead of "2002/10/20 23:47:15 GMT"? Okay, the day-of-week is optional, thank goodness, but was it really necessary to force the use of named months instead of a numerical month? Why expend that extra effort (albeit not that much, but I guess it could add up if you're working with a lot of data and this was 1982) converting from a numerical value to text and back again? Not to mention, the burden added to the programmer. Imagine doing that with C. Not that it's difficult or time-consuming, but just one more annoying thing to have to take care of.

I just added support for "conditional GETs" to kBlog after noticing that Apache was logging script errors for the 304 response code, which meant that kBlog had to parse the If-Modified-Since line in the request header. Great. So it's either load an extra module (Date::Parse) or manually parse the string and feed it to POSIX::mktime. Neither of which is difficult thanks to power of Perl (this blog entry probably took just as long to do), but it's the principle of the matter: what good comes out of this inefficiency in the specification?

This entry was edited on 2006/01/23 at 02:10:29 GMT -0500.

kBlog, 0.1.0-Beta

Thursday, January 19, 2006
Keywords: kBlog

kBlog 0.1.0-Beta has been released today, 19 Jan 2006. I'll post the source code when I get around to it.

It's mostly feature-complete, but not quite. The backend needs to be re-written to use something other than XML, but it's not as bad as it sounds. The current backend was written in just two days, as the backend's behavior is very well-defined and it operates in a sort of controlled environment. The interface is the is complicated part, needing to take into account user interaction, bad input, security, etc. while trying to provide a way for the user (meaning both visitors as well as the blog owner) to do things in an easy and intuitive fashion. And debugging. The code also needs to be audited to make sure that it's kosher to run it with mod_perl (in the event that I need to get a performance boost). There may be a few interface-related things that get changed, depending on what I happen to dream up of in the interim. When all this is done, I'll call it 1.0 and get rid of the beta label, but since 0.1.0-Beta is fully functional, 1.0 is a low-priority task for me, meaning that it'll be done whenever I feel like it. :p

This entry was edited on 2006/02/12 at 17:06:34 GMT -0500.

Using XML for kBlog

Thursday, January 19, 2006
Keywords: kBlog, Technology

I remember a buzz a year or so ago about the use of XML as a data storage format, and how XML files would be ideal to take the place of a database in certain situations. And so, I decided to explore that possibility with the kBlog beta.

Since my needs were pretty simple, I didn't use a full-blown XML parser. The XML::Simple module sufficed. XML works, but it's not very efficient. There's the character escaping, there's the problem that in order to parse XML, you need to look for matching start and end tags, etc. I guess the XML::Simple module may not have been the best choice either, but in the end, it didn't seem to be that suitable. There are many other encodings that are more efficient to parse, and I think I'll switch to an encoding of my own for data storage when I do version 1.0, mostly so that kBlog would then depend only on the fairly common POSIX module (but a minor performance enhancement wouldn't hurt, either) (the XML RSS is simple enough that it is already being generated without XML::Simple; generally speaking, XML is easy to create, but annoying to parse without a parser).

I guess the big upside to XML is human readability. It's easy to read and to edit (well, except for all the character escaping that needs to go on; that cancels out a lot of the benefit, actually), but aside from that, XML parsing isn't as efficient. The other upshot is that XML is codified and is a standard. I just wish someone would codify a data transport/storage standard that's geared towards machine readability and parsing efficiency instead of human readability.

Personally, I think XML is a bit overhyped. It's just a transport/storage format, and there are people who talk it up as if it's somehow The Next Big Thing™.

Fun with CSS (or not)

Thursday, January 19, 2006
Keywords: kBlog, Technology

One of the things that I was looking forward to trying out was the use of CSS for layout. Prior to this, I've used CSS for formatting and tables for layout. It seemed to be the hip new thing to do, and that everyone was doing it, so this should be fun, right?

The Good

The upside to doing this is the divorce of layout from the actual HTML. It was really refreshing to see how clean the HTML code looks compared to the code that I had for my previous website. What I used to do with lots of nested tables I can do now with a few DIVs. It's easier to tweak and adjust my layouts, too. The nicest thing is that for a script-generated site like this, having lighter HTML also means that the script that is generating the page is also much cleaner and easier to look at. And as a bonus, the site is also somewhat usable when CSS is disabled (in Firefox, go to View > Page Style to see what it looks like with all the CSS off), so when I'm browsing using text browser like Lynx (which I sometimes do, actually), things work.

The Bad

And then there's Microsoft. Years ago, when the browser wars were raging between Netscape 4 and Internet Explorer, I was rooting for IE. This was before Gecko, of course. It was such a relief to have a browser that would render what you tell it to render instead of seemingly randomly deciding to made something either 10 pixels wider or narrower (which was what Netscape 4 was doing). I hated the web design process back then because I spent significantly more time tweaking the code to work around the various bugs in Netscape 4 than I spent working on the design itself, so I was not sorry to see Microsoft win the browser wars. But the times have changed, and ever since mid-2003, I've been using Firefox (or Firebird as it was called back then) as my default browser. So come time to do the layout, it was only natural for me to do all the preliminary work on my default browser, following the W3C specifications. That was a dumb thing to do, because as soon as I started to test the layout in IE, things got nasty, either because of spotty CSS implementation or because of the countless bugs in IE. I eventually ended up spending more time banging my head against the wall, cursing Microsoft, and working around the IE-related problems than I spent on the actual layout itself. Déjà vu with an ironic role reversal.

The Ugly

Whoever at the W3C who decided that it would be good a good idea to officially deprecate the use of tables in favor of CSS for layout should be shot. There are certainly many delightful benefits (as mentioned above), but there are severe shortcomings of using CSS for layout. The whole notion of floating the CSS boxes to do columns is really a hack. Just as tables were meant for tabular data and not for layout skeleton, CSS boxes were meant to format parts of a page and not to serve as a layout skeleton, and the liberal use of float to try to shoehorn it into the role of serving as a layout skeleton is just as unwieldy as the use of tables. It also doesn't help that IE is a bit bugged in respect to all this. The lack of stretchable dimensions (or at least the ability to define it with mathematical expressions without resorting to JavaScript) is annoying as well (no, the percentages don't count because you run into rounding problems with them and it doesn't help when sibling elements are fixed in size). Positioning is also a royal pain in the ass. You can either go with the flow, float the position, or take an absolute position in respect to the canvas. There's no way to take an absolute position in respect to something that is relatively positioned (like a parent element). This makes absolute positioning pretty much worthless unless you use it for the entire page (which is not only tedious and difficult to maintain in the long run, but also misses the point). In the end, I found that tables are MUCH easier to work with for layout. They may be ugly and bulky, but at least they are suitable. I do love the benefits of CSS layouts and would love to see the day when CSS layouts completely displace table layouts, but that will require enhancements to CSS to allow them to perform that role. It seems ridiculous that they would deprecate something without first ensuring that the replacement is suitable.

Oh, and another rant against the CSS box model: there are benefits to setting the width and height to the inner content width height. This is especially true if the size of child elements are known and thus it's easy to set the parent's width and height. But in layouts, dimensions are often constrained not by child elements, but by sibling and parent elements, which is why Microsoft's way of doing dimensions in IE5.5 and earlier made a lot of sense: the dimensions of an object representing the content dimensions plus the padding and border. For example, if you have a parent div that is 400 pixels wide, and you want to fit two boxes with a margin space of 20 pixels in between, setting the width was easy: (400-20)/2. But with the official CSS box model, you have to toss the border and padding into that equation, making it slightly inconvenient (especially if you are tweaking the border and padding to see what looks best). This is especially troublesome when working with percentages. A child that has any sort of border or padding can never use a dimension of 100% (making it impossible to get a child with a border and/or padding to match the size of the parent if the parent's size is unfixed). A sensible solution would be to introduce an alternate measure of dimension like outer-width and outer-height, which can be set in lieu of width and height. Another solution would be to accept mathematical expressions in CSS, such as width: 100% - 8px.

Welcome to kBlog!

Thursday, January 19, 2006
Keywords: kBlog

As you've probably already figured out by the name, kBlog is a blogging system that I wrote. And if you're familiar with blogging, then you probably know that there are already a lot of blogging systems out there, from services like Blogger and LiveJournal to pre-packaged install-on-your-own-server systems like WordPress and many others. So, given that, why the heck did I bother to create kBlog? Because I can (i.e., for fun) and because it suits my needs. :)

For fun?

I needed something to do in my spare time. Watching movies was getting boring, so a fun personal programming project seemed like a good thing to do at the time. Although I've done a few large programming projects over the past couple of years, I haven't done a major web-based programming project since mid-2001 (I've done a few smallish web-based things, but nothing that required a major backend). That's a long time in the Internet world. So this was a chance for me to roll up my sleeves and get my hands dirty with relatively new things, like CSS for layout (not just for formatting), the idea of using a blog as the main component of a site (rather than a half-forgotten tacked-on novelty, which was what my circa-2001 blog was like), etc.

Necessity?

Getting my hopelessly-outdated behemoth of a website updated and modernized was always something that was on the back of my mind. I finally decided on just throwing the damn thing out and starting over from scratch, and using a blog as the core. Given the low volume of visitors, using a database seemed overkill. I wanted something lightweight and portable--something that I can easily throw onto just about any web server running Apache without changing much. And so I started to look at the many pre-packaged blog systems, but the few that used a non-database backend were unsuitable for me--either too ugly or didn't offer the kinds of customization and flexibility that I wanted. Ultimately, this contributed significantly to my decision to write my own blogging system.

Never say "How hard could it possibly be?"

Writing kBlog was certainly an interesting experience. I started out in late November, right before Thanksgiving. After finishing the backend and a mock-up of the layout in less than a week, I thought that this would be something that I could finish pretty quickly. Well, seeing as how it's now late January, it's not hard to see that I was a wee bit off on that estimate. Long story short, I hit a few snags, got sidetracked, and let the project collect dust on the shelf for over a month. I'll write about the development process in more depth in another post.

About kBlog

kBlog is written entirely in Perl because Perl is such a great and powerful language.* The Perl requirements are very lightweight: only the POSIX and XML::Simple modules (the latter will be removed for version 1.0) are used. Formatting is done entirely using CSS for layout (okay, that was a bitch to get right, no thanks to Microsoft). The backend data is stored in XML files, but this will change with version 1.0 (XML is used for the beta because I wanted to explore the viability of XML for storage, but it is too inefficient for my tastes). To speed things up in the unlikely event of a /.-like effect (I've actually been /.ed once, but that was also on a dedicated server and not this dinky thing sitting in my room), all served pages cached on the server side so that there is data-access and page-generation processing done only the first time a page is accessed by anyone. Whenever a change is made to the data, the cache is cleared. Since most /. effects involve only page serving and not much interaction (like comment-posting), I feel that this should be able to hold up fairly well.

* One of the things that changed since 2001 is what I call the "downfall" of Perl. It's being slowly supplanted by Python (and recently, by Ruby) as the scripting language of choice, so the reader may wonder why I stuck with a dinosaur like Perl. The first major downside to Perl is that it's difficult for people to read/parse and thus not suitable for team projects, enterprise settings, or other situations where code maintenance is important. Fortunately, I am fluent enough in Perl that I can even think in it (that's what happens when you've used a language for nearly a decade), that downside didn't affect me enough to choose another language. The other downside is that the OO component of Perl is somewhat tacked-on and a bit awkward, but there's no need for OO in a project such as this.

This entry was edited on 2006/02/16 at 02:09:31 GMT -0500.