November 7, 2007

Don't forget about Web Services

In a recent web app, I needed to determine whether a domain name provided by the user is available or has already been registered. I first decided to research Web Services! They're out there, many of them are free, and they are easy to use.

Implemented as a Remote Procedure Call (RPC), you connect with these services through HTTP using a protocol such as SOAP.

Call a web service method by using your scripting languages HTTP libraries or methods. When making calls, method arguments can be provided in the query string. Here is an example in pseudo code:

result = context.httpCall(server='www.webservicex.net',page='/whois.asmx/GetWhoIS?hostname=' + domain)

The example above makes a call to a whois server, using a popular and free web service network, webservicex.

The result will be formatted as XML, generally using the SOAP protocol. Now, every developer in the world will frown on me for saying this, but it must be said! Why bother learning SOAP or XML! Just parse out the value that is useful to you from the response.

For example, the return XML above is just a long string in your scripting languages eyes. Do some simple string parsing as follows:

if result.find('No match') > 0:
    return True
else:
    return False
August 15, 2007

Preventing spam on your blog

I'm fairly new to the blogging world, as you can see by looking at my oldest post. I've already run into spam issues that cause hundreds of spam comments per day to be posted to my articles.

My first course of action was to install a CAPTCHA plugin. This particular plugin works for my blogging system, which is bBlog. It can be downloaded at http://www.handgestrickt.biz/bblog/files/File/Downloads/bblog_captcha_0_2.zip.

So I installed the CAPTCHA plugin, only to find that my comment SPAM has only been reduced by about 10%. After analyzing the comments, I discovered that 99% of them were small messages such as 'nice post' or 'great blog' or contained nothing but a link to another website.

With these findings, I decided simply to filter out all comments less than 50 characters in length as well as comments containing links. The PHP equivalent of this filter is:

if ((strpos($comment, '<a') === false) && strlen($comment) > 50)

If you are using bBlog, you can insert this conditional in the bBlog.class.php script as follows:

if ((strpos($comment, '<a') === false) && strlen($comment) > 50)
{
  $q = "insert into ".T_COMMENTS."
    set $parentidq
    postid='$postid',
    title='$title',
    posttime='$now',
    postername='$postername',
    posteremail='$posteremail',
    posterwebsite='$posterwebsite',
    posternotify='$notify',
    pubemail='$pubemail',
    pubwebsite='$pubwebsite',
    ip='$remaddr',
    commenttext='$comment',
    onhold='$onhold',
    type='comment'";
  $this->query($q);
  $insid = $this->insert_id;
}
if($insid < 1) {
  $this->standalone_message("Error", "Error inserting comment: Comment may be too short or contain links. This is a spam prevention mechanism.");
August 1, 2007

Redirection with PHP header

Just today I created a web form with a 'Submit' and a 'Cancel' button.

<input type="submit" value="OK" name="submit" />
<input type="submit" value="Cancel" name="submit" />

The form submits to a script with the following code at the top:

if ($_REQUEST['submit'] == "Cancel") {
    header ("Location: ../index.php"); // redirect back to start page
}

// process form
...

So, what's the problem? As a coder, I would expect the code to stop executing at the point where the header call is made. Not so, at least in my PHP implementation.

It was as simple as placing an exit statement at the end of the conditional:


if ($_REQUEST['submit'] == "Cancel") {
    header ("Location: ../index.php"); // redirect back to start page
    exit();
}

// process form
...
July 21, 2007

Recursive SQL Queries

Earlier in the week I talked about creating object databases and the benefits of using one as the back-end for a content-based website.

For instance, lets say we have the following URL:

http://www.mysite.com/articles/how-to/home/plumbing

The following items in the URL are objects in the database:

  • plumbing (parent is home)
  • home (parent is how-to)
  • how-to (no parent)

So your object table might look like:

id parent_id object_type local_id title
1 NULL category how-to How To
2 1 category home Home
3 2 article plumbing Plumbing

Not all DBMSs offer recursive queries, so this method only applies to those that don't. The following SQL query will return a list of those objects in hierarchical form:

SELECT t1.local_id as local_id1, t1.title as title1, t2.local_id as local_id2, t2.title as title2, t3.local_id, t3.id, t3.object_type, t3.user_id, t3.title, t3.ts_modified FROM object AS t1 LEFT JOIN object AS t2 ON t2.parent_id = t1.id LEFT JOIN object AS t3 ON t3.parent_id = t2.id WHERE t1.local_id = 'how-to' AND t2.local_id = 'home' AND t3.local_id = 'plubming'

Now, the question is, how can we dynamically generate such a query? In PHP, you can use the following code as a guide.

$object_path = array_slice(explode('/', $_SERVER['PATH_INFO']),1);

$depth = sizeof($object_path);

$select = 'SELECT ';
for ($i = 1; $i < $depth; $i++)
  $select .= "t{$i}.local_id as local_id{$i}, t{$i}.title as title{$i}, ";

$select .= 't' . $depth . '.local_id, t' . $depth . '.id, t' . $depth . '.object_type, t' . $depth . '.user_id, t' . $depth . '.title, t' . $depth . '.ts_modified ';

$from = 'FROM object AS t1 ';
$join = '';
for ($i = 2; $i <= $depth; $i++) {
  $prev = $i - 1;
  $join .= ' LEFT JOIN object AS t' . $i . ' ON t' . $i . '.parent_id = t' . $prev . '.id ';
}

$where = "WHERE t1.local_id = '{$object_path[0]}'";
for ($i = 2; $i <= $depth; $i++)
  $where .= " AND t{$i}.local_id = '{$object_path[$i - 1]}'";

$sql = $select . $from . $join . $where;
July 16, 2007

SEO friendly URLs and Object Databases

Most of us are aware that search engines have a much easier time digesting URLs that do not include query string parameters. Search engines, mainly Google, especially frown upon pages that use the ‘id=’ parameter to locate a specific record in a database for retrieval.

The big question is, how do we pull content from a database without identifying it’s primary key with the ‘id=’ query string parameter. The answer starts with your web server and ends with an object database.

I found it particularly important to provide search engine friendly URLs for one site I was building and decided to do a little research. I wrote a small tutorial on configuring the Apache web server . This configuration allowed me to define a script, article.php, where any path information in the URL following article would be provided to the script. For example, if the following URL was called:

mysite.com/article/how-tos/home-repair/replace-electrical-outlets

The ’/how-tos/home-repair/replace-electrical-outlets’ would be provided to my script, in the PATH_INFO server variable. It then becomes a simple matter of parsing the path into an array, or any other structure useful to you.

This is where the next problem came into play. I had many different tables in my database, each referring to a specific type of object. Let’s say I had a table for categories, articles and reviews. If I were to use the path above, how will my code know which table / object ‘replace-electrical-outlets’ refers to. Is it a category, article, or a review?

I looked to the object database for a solution to my problem. After becoming handy with the Zope object database, I decided to create my own. Not knowing anything about the concepts and design behind them, I came up with something that worked for my purposes.

Basically, I have one “master” table in my database called ‘object’. This table contains:

  • id
  • parent_id
  • user_id
  • object_type
  • local_id
  • title

Also, a few other fields not relevant to this article. Id, being the primary, basically gave each object a globally unique identifier, so even two objects of different types would never have the same id. Parent_id allows our hierarchical relationship between objects, where an ‘article’ can have a ‘category’ parent, or even a ‘category’ can have another ‘category’ parent.

The User_id specifies which user created this object. Object_type simply indicates which type of object this is, in my case, a ‘category’, ‘article’, or ‘review’. Local_id is the name you see in the URL. In the case of the example article, local_id would be ‘replace-electrical-outlets’.

And finally, title specifies the human readable title, in this case, ‘Replace Electrical Outlets’.

In some cases, certain objects had other attributes. Reviews, for example, needed to have a ranking from 1 to 5, so an integer field was necessary. Since not all objects needed this extra field, I was reluctant to add it to the ‘object’ table. Instead, I created a completely new table called ‘review’. The review table contained the following fields:

  • object_id
  • ranking

The object_id is simply a one-to-one relationship with the ‘object’ table’s id. Now, when a ‘review’ record is identified in the object table, we use the ‘object_type’ field to specify that this object is a ‘review’ record and a second query needs to be called on the ‘review’ table to retrieve the extra information.

So far this approach has been working for me. I’m sure there are more efficient ways, but for my small site, this seemed to make the most sense.