How to Scrape Pages With ColdFusion

by Patrick Altoft on / 17 responses

This is a guest post by Guy from nullamatix.com

With the exponential growth of the Internet, data harvesting has become increasingly popular in the last few years. Several web sites sell large databases of information relevant to lawyers, doctors, businesses, schools, just about anything imaginable.

After seeing all this content, I asked myself, “How is all this information compiled?” Surely some poor sap isn’t being paid to manually insert each record. With a little research, I was able to come up with a pretty simple solution using Coldfusion.

To keep things simple, we’re going to harvest data from articles-hub.com. First, open your favorite text editor and drop in the following code:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>

This tells Coldfusion to literally get the contents of the specified page, then store that content into a variable named sDoc.

The following bit of code is where the magic happens. If you’re unfamiliar with regular expressions, now is a great time to learn. Insert the following bit of code after the variable declaration mentioned above:

<cfset regExp = '<span class="article_display_title" >
        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>

Without going into to much detail, this variable tells Coldfusion what to look for, and where. View the source code of the page defined above and goto line 1016. You’ll notice the span tag defined in regExp is on that line. When our application is executed, Coldfusion will begin searching sDoc for that tag. Once located, the data sitting in place of the first expression ([\s\S]*?) will be defined as $1, which is the article’s title. Coldfusion continues searching, and looks over everything between:

</span>[\s\S]*?<div align=[\s\S]*?</div>

until the next expression containing the actual article content is reached. Finally, our variable stops when the two consecutive </div> tags are reached.

This information should simplify the regular expression creation process. Any set of information you’re wanting to store for later, use ([\s\S]*?). If you’re wanting to skip over anything, use [\s\S]*?.

With our data sets defined, we can output the results into a nice, organized product. Drop in this code:

<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>

The code above tells Coldfusion to create a virtual query with two columns: title and article. Next, a starting point to loop through the results is defined. The loop is then started and begins searching sDoc with the regular expression criteria defined above. Each matching result is parsed, stored in a virtual row with the respective column, and assigned unique ID. We’re now ready to test our primitive data mining application.

Here’s how our application should look as of now:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" > 

        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfdump var="#q_srch#">

Go ahead and save the file as miner.cfm, or whatever you’d like, and browse to that file in your web browser. For example, http://192.168.230.239:80/miner.cfm. The article’s title and content are displayed in an organized table.

Here’s a screen shot of data harvested from a site containing US College information:

US School Data

Ok, that’s nice, but this information is totally useless unless we can dump it into a database, so here’s what we need to do.

After the </cfloop> tag, drop in a modified version of this code:

<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>

The value of datasource is completely independent to each system – that just so happens to be the name of my datasource. After defining the appropriate datasource, you can either create a table with 3 columns (id, title, content) called article_dump, or us an already existing table. Just make sure to change the code where necessary. If you refresh miner.cfm in your browser, the data is not only displayed, but inserted into our database, too.

Let’s take this a step further, and automate the entire process. Go back to the top of miner.cfm and add the following code as the first line:

<cfloop from="500" to="5000" index="LoopCount">

Now replace 700.html on the second line with:

#LoopCount#.html

Scroll to the bottom and add a the closing cfloop tag to the last line:

</cfloop>

We just told Coldfusion to visit 500.html, 501.html, 502.html, 503.html, etc, until 5000.html is reached and insert each set of results into the database before moving onto the next. With this short piece of code, I’ve created databases with over 20,000 records in less than an hour, and now you can, too.

Here’s the entire final product:

<cfloop from="500" to="5000" index="LoopCount">
<cfhttp url="http://www.articles-hub.com/Article/#loopcount#.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" > 

        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>
</cfloop>

Patrick Altoft is Director of Search at Branded3, a Leeds SEO & Digital Agency specialising in SEO, Web Design, Development & Social Media.

Get daily posts direct to your inbox

You can get our blog posts delivered for free by email every day - simply add your email address to the box above, or alternatively you can grab the RSS feed.

Comments

Read the 12 comments below, or add your own!

January 26, 2008 at 5:32am

what the heck!?

Great info

Reply

January 26, 2008 at 6:18am

I like this stuff because although it’s useless to anybody who doesn’t want to scrape with ColdFusion it is gold dust to the people who do.

Reply

January 26, 2008 at 9:46am

Well said. This little bit of code has near endless potential. If you’re unfamiliar with Adobe’s coldfusion, I highly recommend the open-source, free alternative call, “The Smith Project.” Just Google that phrase and check it out.

Setup IIS, Smith Project, and MySQL locally, and let the data harvesting begin :)

Reply

January 26, 2008 at 8:56am

Seems cool. Too bad I have 0 experience with coding.

Reply

January 28, 2008 at 4:49am

That’s the point of this tutorial.. is there something in particular you’re having difficulty with?

Reply

January 27, 2008 at 3:09am

A truly brilliant script! I’ve never did anything with ColdFusion before, but the language looks very powerful.

Reply

Jason
June 2, 2008 at 5:43am

Thanks for the great tutorial. However, I’m running through 2500 links and scraping data from each one. I have a database that holds all the URL’s and I loop through the query of those URL’s to get the data I need. However.. it eventually gets to a page where it says element at pos[2] cannot be found. So I check my RegExp’s on that page and everything runs smooth.. I run the program again and it stops on a different page. I’m thinking the page is timing out when coldfusion tries to request it.. perhaps because I’m requesting so many so fast. Any ideas?

Reply

June 2, 2008 at 12:36pm

Jason,

You were on track with examining the page’s source; the error tells me CF is unable to find the title or anchor text of the link in your case? To better understand the issue and hopefully resolve it, feel free to shoot me an email: my lastname @ nullamatix.com – (reformat accordingly, obviously).

Blogstorm readers fear not; I or Jason will follow-up with the solution (minus non-relevant details) once we’ve figured out a solution. I just wasn’t sure if Jason was comfortable having an open discussion here (or even via email) regarding his scraping project :P

-Guy

Reply

April 2, 2009 at 10:59am

thats what I’m talking about. thanks man !

Reply

antonio
January 22, 2010 at 11:41am

Great tutorial ! Thank you very much! ;)

Reply

Misty
July 28, 2010 at 6:12am

Hi can u please if we need to fetch out the tables, then how we can do!

Reply

Marc Williams Jr
September 2, 2010 at 9:16pm

This seems like a great script and we are reviewing it now. The question i have is that i need to scrape through more then one page. here is how it works
-I enter search criteria
-a page is returned with 25 of a possible 1000 results. I need them all
-I need to not only go through all of the 10 pages(100 per page) i need to click through each link to get more information
-The results page has the name and Id of what i need
-The link to another page has the email address
-i need all three elements to complete my data mining.

i noticed this was from 2009 but i am optimistic. thank you.

Reply

5 trackbacks

Leave a comment

Your email address will not be published. Fields marked with an asterisk are required.
 

  *

  *

You can use one of the following tags:
<a href=""><blockquote><code><em><strike><strong>