Follow me on Twitter

How to Scrape Pages With ColdFusion

by Patrick Altoft on January 26, 2008

This is a guest post by Guy from nullamatix.com

With the exponential growth of the Internet, data harvesting has become increasingly popular in the last few years. Several web sites sell large databases of information relevant to lawyers, doctors, businesses, schools, just about anything imaginable.

After seeing all this content, I asked myself, “How is all this information compiled?” Surely some poor sap isn’t being paid to manually insert each record. With a little research, I was able to come up with a pretty simple solution using Coldfusion.

To keep things simple, we’re going to harvest data from articles-hub.com. First, open your favorite text editor and drop in the following code:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>

This tells Coldfusion to literally get the contents of the specified page, then store that content into a variable named sDoc.

The following bit of code is where the magic happens. If you’re unfamiliar with regular expressions, now is a great time to learn. Insert the following bit of code after the variable declaration mentioned above:

<cfset regExp = '<span class="article_display_title" >
        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>

Without going into to much detail, this variable tells Coldfusion what to look for, and where. View the source code of the page defined above and goto line 1016. You’ll notice the span tag defined in regExp is on that line. When our application is executed, Coldfusion will begin searching sDoc for that tag. Once located, the data sitting in place of the first expression ([\s\S]*?) will be defined as $1, which is the article’s title. Coldfusion continues searching, and looks over everything between:

</span>[\s\S]*?<div align=[\s\S]*?</div>

until the next expression containing the actual article content is reached. Finally, our variable stops when the two consecutive </div> tags are reached.

This information should simplify the regular expression creation process. Any set of information you’re wanting to store for later, use ([\s\S]*?). If you’re wanting to skip over anything, use [\s\S]*?.

With our data sets defined, we can output the results into a nice, organized product. Drop in this code:

<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>

The code above tells Coldfusion to create a virtual query with two columns: title and article. Next, a starting point to loop through the results is defined. The loop is then started and begins searching sDoc with the regular expression criteria defined above. Each matching result is parsed, stored in a virtual row with the respective column, and assigned unique ID. We’re now ready to test our primitive data mining application.

Here’s how our application should look as of now:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" > 

        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfdump var="#q_srch#">

Go ahead and save the file as miner.cfm, or whatever you’d like, and browse to that file in your web browser. For example, http://192.168.230.239:80/miner.cfm. The article’s title and content are displayed in an organized table.

Here’s a screen shot of data harvested from a site containing US College information:

US School Data

Ok, that’s nice, but this information is totally useless unless we can dump it into a database, so here’s what we need to do.

After the </cfloop> tag, drop in a modified version of this code:

<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>

The value of datasource is completely independent to each system – that just so happens to be the name of my datasource. After defining the appropriate datasource, you can either create a table with 3 columns (id, title, content) called article_dump, or us an already existing table. Just make sure to change the code where necessary. If you refresh miner.cfm in your browser, the data is not only displayed, but inserted into our database, too.

Let’s take this a step further, and automate the entire process. Go back to the top of miner.cfm and add the following code as the first line:

<cfloop from="500" to="5000" index="LoopCount">

Now replace 700.html on the second line with:

#LoopCount#.html

Scroll to the bottom and add a the closing cfloop tag to the last line:

</cfloop>

We just told Coldfusion to visit 500.html, 501.html, 502.html, 503.html, etc, until 5000.html is reached and insert each set of results into the database before moving onto the next. With this short piece of code, I’ve created databases with over 20,000 records in less than an hour, and now you can, too.

Here’s the entire final product:

<cfloop from="500" to="5000" index="LoopCount">
<cfhttp url="http://www.articles-hub.com/Article/#loopcount#.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" > 

        ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
</div>
    ([\s\S]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>
</cfloop>

You can get our blog posts delivered for free by email every day - simply add your email address to the box below or alternatively grab the RSS feed.

Read some similar posts

Post category: Coding   Share it : delicious | digg | reddit | StumbleUpon | Google Bookmarks | Sphinn

{ 9 comments… read them below or add one }

Desmond 26 Jan 2008 at 5:32 am

what the heck!?

Great info

Patrick Altoft 26 Jan 2008 at 6:18 am
Find me on Twitter

I like this stuff because although it’s useless to anybody who doesn’t want to scrape with ColdFusion it is gold dust to the people who do.

More comments from Patrick Altoft
Sammy Ashouri 26 Jan 2008 at 8:56 am

Seems cool. Too bad I have 0 experience with coding.

Guy Patterson 26 Jan 2008 at 9:46 am

Well said. This little bit of code has near endless potential. If you’re unfamiliar with Adobe’s coldfusion, I highly recommend the open-source, free alternative call, “The Smith Project.” Just Google that phrase and check it out.

Setup IIS, Smith Project, and MySQL locally, and let the data harvesting begin :)

Howard Young 27 Jan 2008 at 3:09 am

A truly brilliant script! I’ve never did anything with ColdFusion before, but the language looks very powerful.

Guy Patterson 28 Jan 2008 at 4:49 am

That’s the point of this tutorial.. is there something in particular you’re having difficulty with?

Jason 02 Jun 2008 at 5:43 am

Thanks for the great tutorial. However, I’m running through 2500 links and scraping data from each one. I have a database that holds all the URL’s and I loop through the query of those URL’s to get the data I need. However.. it eventually gets to a page where it says element at pos[2] cannot be found. So I check my RegExp’s on that page and everything runs smooth.. I run the program again and it stops on a different page. I’m thinking the page is timing out when coldfusion tries to request it.. perhaps because I’m requesting so many so fast. Any ideas?

Guy Patterson 02 Jun 2008 at 12:36 pm

Jason,

You were on track with examining the page’s source; the error tells me CF is unable to find the title or anchor text of the link in your case? To better understand the issue and hopefully resolve it, feel free to shoot me an email: my lastname @ nullamatix.com – (reformat accordingly, obviously).

Blogstorm readers fear not; I or Jason will follow-up with the solution (minus non-relevant details) once we’ve figured out a solution. I just wasn’t sure if Jason was comfortable having an open discussion here (or even via email) regarding his scraping project :P

-Guy

b sizzle 02 Apr 2009 at 10:59 am

thats what I’m talking about. thanks man !

{ 3 trackbacks }

1 Month Commission Junction Earnings Report
01.27.08 at 6:29 am
Scraping Google SERPs with ColdFusion
07.28.08 at 4:21 am
Search Engine Optimization » Scraping Google SERPs with ColdFusion
12.26.08 at 10:21 am

Leave a Comment (registration is optional)

Registration is free, takes about 5 seconds and is worth doing.

You can use these HTML tags and attributes: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>