Creating some sort of search functionality is something we all have to do at one point or another. For the good cases, we have fulltext search or an external program available to us. But sometimes, getting down and dirty with plain SQL is the only option. When going the SQL route, I often see most of the ranking logic going on in the PHP still, like so:
$sql = 'SELECT * FROM Posts WHERE Body LIKE "%term%"';
$rs = $db->query($sql);
while($row = mysql_fetch_array($rs, MYSQL_ASSOC)) {
$count = substr_count($row['Body'], "term");
// push count into an array
}// array has all records. order by count then display
Over the weekend I was playing around with doing all of this in SQL. This immediately became a pain because there is no easy way to count substrings, but what we can do is:
count = (strlen(Body) - strlen(str_replace(Body, term)))/strlen(term)
Let’s plug in some numbers.
- Body: Hello, he said. What’s her name?
- Term: he
- Count: (32 - 26)/2 = 6/2 = 3
And then we can translate that to SQL:
SELECT
SUM(((LENGTH(Body) - LENGTH(REPLACE(Body, 'term', '')))/4))
AS Occurrences
Now that we know how to find the amount of occurrences of one string within another, we can build up basic rankings by including groupings, multiple terms, and joins. Below is an example of a query that searches one row across multiple columns and returns them from best match to worst.
SELECT
SUM(((LENGTH(Body) - LENGTH(REPLACE(Body, 'term', '')))/4) +
((LENGTH(Body) - LENGTH(REPLACE(Body, 'search', '')))/6))
AS Occurrences
FROM Posts
GROUP BY PostId
ORDER BY Occurrences DESC
Have you played with Zend_Search_Lucene? I haven’t had an excuse to use it yet, but it looks very cool.
In terms of your first example - using PHP to count sub-strings - can you use preg_match_all?
$matches = array(); $count = preg_match_all (“/term/i”, $row[‘Body’], $matches);
Either way, doing it in the SQL is way cooler…
That’s great… I was just trying to figure out a good way to do this when I didn’t have fulltext indexing at my disposal in a database… thanks so much.
Nathan, I’ve actually just begun implementing Lucene in a project. So far, it seems like a great solution. I’m new to it though, so still working out the indexing.
Dave, your way should work also.
Funny timing, I literally implemented Zend’s Lucene search yesterday. In some preliminary testing, it’s been VERY fast at returning results, but there are definitely some limitations as it’s not fully developed. For example, wildcard searches for example aren’t implemented yet, which is a huge drawback. It also converts numbers in your search string to spaces, so searching for “term1” will rank “term1”, “term2”, “term3” the same. I do think it will be a great solution given some time though…like I said, it’s crazy fast.
Ryan’s solution is interesting, but I can’t help but wonder about performance. While using SQL for the logic is a little more ninja, I can’t help but feeling it will be slower, especially on large tables. Have you done any performance testing on this Ryan?
Awesome! Keep up these small SQL gems, they’re very useful. *bookmarked
I had the same thought, Adam. If I have a chance to run some tests, I’ll post the results. Right now, it works fine for my dev environment with 1000 records and a one to many relationship join in the query. My guess would be that this doesn’t scale too well though.
Wont it be significantly slower in SQL over say 100,000 rows, versus letting PHP handle the count?
This isn’t relevant to your post, but what are you using for this comment system? Whatever it is, I think it looks awesome.
I think this is the best way to rank results by occurences for small to mid size websites as it outputs something relevant when it comes to the score result. Very nice example.
I think the best way to do this is to use the Full-Text Search Functions in MySql ie. MATCH (col1,col2,…) AGAINST (expr [IN BOOLEAN MODE | WITH QUERY EXPANSION]). According to the MySQL manual :
The MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). The search is performed in case-insensitive fashion. For every row in the table, MATCH() returns a relevance value, that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.
When MATCH() is used in a WHERE clause, as in the preceding example, the rows returned are automatically sorted with the highest relevance first. Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.
For natural-language full-text searches, it is a requirement that the columns named in the MATCH() function be the same columns included in some FULLTEXT index in your table. For the preceding query, note that the columns named in the MATCH() function (title and body) are the same as those named in the definition of the article table’s FULLTEXT index. If you wanted to search the title or body separately, you would need to create FULLTEXT indexes for each column
Ready to like