In early January, I released a major update of the DIY-IT Project Guide, our compendium of all the maker-style projects I've done here on ZDNet. It consisted of a curated compendium of what was then 347 articles (there are more now), organized by category. The Project Guide was, itself, a project that involved coding, design, productivity tricks, and organization techniques.
Here's how I put it all together.
Collecting the raw article links
My DIY-IT column, like all the other blogs here on ZDNet, is essentially a category in our content management system. Unfortunately, there was no easy way to simply export the names and URLs of all the articles. And I just didn't want to cut and paste hundreds of links and titles.
Instead, I wrote a small scraping program in PHP to traverse the 24 pages of article listings and capture the relevant information. Pulling out just the data I wanted from the pages crowded with cross-links, navigation elements, advertising components, and text blocks required some creative thought.
Each page served to a browser is presented in HTML, which is a form of structured source code. Because ZDNet pages are relatively well designed, the key was to dig through the source code and locate two "selectors," items that uniquely identified what I needed.
The first selector was the element identifying the article itself. Fortunately, most of the article titles were represented with H3 tags, so it was mere matter of programming to pull out the A HREF code following the H3 tags.
I needed two selectors because I also needed to be able to find the next page link so I could traverse the entire set of pages containing my article listings.
To programmers, the phrase "mere matter of programming" is shorthand for "how the heck am I going to do this?" Scraping data from a Web page can be quite a challenge. Fortunately, modern Web pages are designed using CSS, and CSS uses selectors to function. The idea is that you use those selectors to specify how a part of the page gets styled. On a typical ZDNet index page, the article titles are styled in a specific way, so they can be selected.
I found a great little PHP library to help with this called Simple HTML DOM. DOM stands for Document Object Model and is used to describe the logical structure of the document - in this case an HTML page.
With Simple HTML DOM available to help me parse the retrieved Web pages, all I needed was code to retrieve the pages themselves. For that, I used another function of Simple HTML DOM, the file_get_html function. You give file_get_html a URL and it returns the contents in a variable.
With those two tools, I was ready to rock. The code begins by asking for a starting page. It then runs a loop that retrieves the contents of a page, goes through the page to pull out all the article references, and then checks to see if there are more pages. If there are more pages, it does it all again.
The code for this whole thing is attached to the end of this article. Yes, there is code.
Once the program finished running, I was presented with a big list of 347 lines containing links and article titles. That brought me to the next step.
Organizing the articles into categories
I didn't just want to present readers with a big, unsorted list of articles. I wanted readers interested in 3D printing to be able to find those articles, people interested in the broadband studio project to find that, and so forth. The challenge is that I don't write on these topics in order. For example, I've been writing about the broadband studio on and off for more than five years.
To aggregate the articles into topics, I used two very helpful tools: Evernote and BBEdit, a Mac-based text editor.
In Evernote, I created a new DIY-IT Project Guide notebook, with a page (a Note in Evernote terms) for each project. As I found articles for each project, I deleted those A HREF out of the main block produced by my program and pasted them into the appropriate corresponding note in Evernote.
BBEdit has a feature that proved to be a huge time saver. which is why I chose that tool. BBEdit is known for its very powerful text processing capabilities and that was proven true with the Process Lines Containing feature.You choose Process Lines Containing, located under the Text menu. A dialog is presented, with two key options: Delete Matched Lines and Copy To Clipboard.
How this worked for me was simple. I used it to collect articles that might match the topic I was looking for. For example, I processed lines containing "studio," which found many of my broadband studio articles, deleted them out of the master list, and dropped them into the clipboard. I then pasted that into the appropriate Evernote note. I did the same with "gmail", "office 365", "google voice", and so on.
Rinse. Wash. Repeat.
By the time I was done, Process Lines Containing had helped me bulk find and catalog probably 90 percent of the articles. I then went into what was left of the master list and moved the rest of the articles into their appropriate categories. I also deleted about ten articles that were just not project related in any way. Those didn't make it into the Project Guide.
I also used the Evernote note structure to write short introductions to each category and properly fine-tune the names of each category, so it would be easier for you to find the articles you're looking for.
Designing within CMS constraints
Next came the design challenge. I wasn't just designing a Web page here, where I could use whatever style or look I wanted. I needed to make sure the Project Guide worked as an article within our main CMS. That meant I had to work within considerable constraints.
I originally created each article category as an element of a list. You'd come into the Project Guide and see a long list of 22 category names. But once I put that together and looked at it, it became apparent almost immediately that everything began to blur together quickly. I needed a graphic style that would make it easy for readers quickly glance through the list to find what they're looking for.
I decided a grid of graphics would do to help create the table of contents. These became a series of 22 320x40 images. Each topic itself would be headed by a graphic, which would be 640x80. That way, I could use the same graphic for both index entry and topic header - just sized differently.
I created a Photoshop template with layers for title, subtitle, and the image. Then, using GraphicStock (a great, low-cost image service) and PixaBay (a free image service), combined with some of my own images from various articles, I gathered together representative images. To make sure the text was visible clearly, I dropped a black rectangle over each image that I set to 50 percent opacity. That produced a slightly subdued image with vivid text.
Putting it all together
If you look at the code at the end of this article, you'll notice that I had the program automatically wrap each link in an LI tag, which provides bullets for each article.
Constructing the full Project Guide consisted of building up the intro text, adding the series of index images, and then, for each topic, a banner image, the intro, and the set of LI and A HREF tags.
Here was where the CMS fought back. Content management systems are designed to normalize editorial content to fit in a standard for the overall site. As a result, most good content management systems will clean up text entered by writers and editors. making a wide range of substitutions to bring the final text into compliance.
Sometimes, this also means converting pedestrian HTML into custom blocks of HTML that represents the internal architecture of the CMS. I ran afoul of this capability of our CMS, when it wanted to move paragraphs around, eliminate some of my DIVs, and fought back mightily against my grid of images.
Eventually, I got it all to mostly function. I had to give up on some styling because I don't have access to the site's CSS. Can you imaging of all of CBS News, CNET, and ZDNet's authors just randomly decided to style their own stuff? Yeah, that's why there's a CMS. In any case, I managed to coerce the system into getting close to what I wanted, and I was done.
The end result
I'm reasonably happy with the results, with some exceptions. Because the CMS has its own way of doing things, when I add new articles, I have to go into the HTML code and edit there. It's somewhat fragile, so I always have to pay extra attention in case something breaks. That adds time to keeping the guide updated, but it's worth it.
My biggest disappointment has to do with the behavior of anchor tags (you know, the # you insert in a URL to take you to a specific place on a page). I made sure to insert anchors for each project in the Project Guide's code. The idea was that I could then provide links to specific projects, and readers could simply click the link and be brought to that section of the guide.