CF11 issue

I’m trying to find an efficient way of pulling in a website’s metadata keywords (from the <meta> tag). Server’s running CF11

So far I’ve tried using the CFHTTP tag to pull in the data, but based on what I’m reading online people don’t seem to recommend using regular expressions for this task. The alternative seems to involve finding or building some sort of HTML parser, but I haven’t found any that work well, and I don’t have control over the server so I’m not able to install anything on it. I looked into using ColdFusion’s XMLPARSE, but that doesn’t seem to be what I’m after either.

The websites I’m going to pull this data from are not standardized, so I can’t rely on the <meta name=”keywords” {…} /> tag to be in the same format every time. It could be missing, it could have the name at the front, or at the end, the end could be />, but it could be just >

Any tips on how to do this without using too much processing power? I am looking for a solution that is efficient. The result should just be a string of keywords found on the website I point it at.

You want to look at jsoup

Add the jar to your CF server and you can very easily use it for parsing HTML.

It uses a selector syntax very similar to jQuery which makes it really easy and powerful.

Try parsing the HTML as XML and look it up with xpath expressions.

I tried storing the entire page as a string and parsing it using XMLParse(), but the function doesn’t seem to be designed to make it easy for you to traverse through the HTML DOM structure or whatever and pull out the information you want. For this I was sort of hoping to find something similar to a jquery select statement that finds the object you want and allows you to easily pull out whatever information you’re looking for. I need a server-side solution though, so I can’t use client-side stuff.

Do you mean a different approach than the one I took though? I am not familiar with xpath expressions, not sure how to approach the problem from that angle, but will read up on xpath expressions  tomorrow, thanks!

There are a LOT of other libraries out there in other languages that can traverse through elements like jquery can using element, ID, and class selectors.

I’ve done it in Ruby, PHP, and C#. I’m not aware of any for CFML.

XPath is not HTML specific, it’s how to select and traverse XML nodes by element name, attributes, etc. Should be pretty easy if you’re just looking to get Meta tags.

Related Posts

Leave a Reply