One of the most common data integration tasks I run into is a desire to move data from a file into a database table. Generally the user is familiar with his data, the structure of the file, and the database table, but is unfamiliar with data integration tools and therefore views this task as something that is difficult. What these users really need is a point and click approach that minimizes the learning curve for the data integration tool. This is what CSVexpress (www.CSVexpress.com) is all about! It is based on expressor Studio, a data integration tool I’ve been reviewing over the last several months.
With CSVexpress, moving data between data sources can be as simple as providing the database connection details, describing the structure of the incoming and outgoing data and then connecting two pre-programmed operators. There’s no need to learn the intricacies of the data integration tool or to write code. Let’s look at an example.
Suppose I have a comma separated value data file with data similar to the following, which is a listing of terminated employees that includes their hiring and termination date, department, job description, and final salary.
EMP_ID,STRT_DATE,END_DATE,JOB_ID,DEPT_ID,SALARY 102,13-JAN-93,24-JUL-98 17:00,Programmer,60,"$85,000" 101,21-SEP-89,27-OCT-93 17:00,Account Representative,110,"$65,000" 103,28-OCT-93,15-MAR-97 17:00,Account Manager,110,"$75,000" 304,17-FEB-96,19-DEC-99 17:00,Marketing,20,"$45,000" 333,24-MAR-98,31-DEC-99 17:00,Data Entry Clerk,50,"$35,000" 100,17-SEP-87,17-JUN-93 17:00,Administrative Assistant,90,"$40,000" 334,24-MAR-98,31-DEC-98 17:00,Sales Representative,80,"$40,000" 400,01-JAN-99,31-DEC-99 17:00,Sales Manager,80,"$55,000"
Notice the concise format used for the date values, the fact that the termination date includes both date and time information, and that the salary is clearly identified as money by the dollar sign and digit grouping. In moving this data to a database table I want to express the dates using a format that includes the century since it’s obvious that this listing could include employees who left the company in both the 20th and 21st centuries, and I want the salary to be stored as a decimal value without the currency symbol and grouping character. Most data integration tools would require coding within a transformation operation to effect these changes, but not expressor Studio. Directives for these modifications are included in the description of the incoming data.
Besides starting the expressor Studio tool and opening a project, the first step is to create connection artifacts, which describe to expressor where data is stored. For this example, two connection artifacts are required: a file connection, which encapsulates the file system location of my file; and a database connection, which encapsulates the database connection information. With expressor Studio, I use wizards to create these artifacts.
First click New Connection > File Connection in the Home tab of expressor Studio’s ribbon bar, which starts the File Connection wizard. In the first window, I enter the path to the directory that contains the input file. Note that the file connection artifact only specifies the file system location, not the name of the file.
Then I click Next and enter a meaningful name for this connection artifact; clicking Finish closes the wizard and saves the artifact.
To create the Database Connection artifact, I must know the location of, or instance name, of the target database and have the credentials of an account with sufficient privileges to write to the target table. To use expressor Studio’s features to the fullest, this account should also have the authority to create a table.
I click the New Connection > Database Connection in the Home tab of expressor Studio’s ribbon bar, which starts the Database Connection wizard. expressor Studio includes high-performance drivers for many relational database management systems, so I can simply make a selection from the “Supplied database drivers” drop down control. If my desired RDBMS isn’t listed, I can optionally use an existing ODBC DSN by selecting the “Existing DSN” radio button.
In the following window, I enter the connection details. With Microsoft SQL Server, I may choose to use Windows Authentication rather than rather than account credentials. After clicking Next, I enter a meaningful name for this connection artifact and clicking Finish closes the wizard and saves the artifact.
Now I create a schema artifact, which describes the structure of the file data. When expressor reads a file, all data fields are typed as strings. In some use cases this may be exactly what is needed and there is no need to edit the schema artifact. But in this example, editing the schema artifact will be used to specify how the data should be transformed; that is, reformat the dates to include century designations, change the employee and job ID’s to integers, and convert the salary to a decimal value.
Again a wizard is used to create the schema artifact. I click New Schema > Delimited Schema in the Home tab of expressor Studio’s ribbon bar, which starts the Database Connection wizard. In the first window, I click Get Data from File, which then displays a listing of the file connections in the project. When I click on the file connection I previously created, a browse window opens to this file system location; I then select the file and click Open, which imports 10 lines from the file into the wizard.
I now view the file’s content and confirm that the appropriate delimiter characters are selected in the “Field Delimiter” and “Record Delimiter” drop down controls; then I click Next.
Since the input file includes a header row, I can easily indicate that fields in the file should be identified through the corresponding header value by clicking “Set All Names from Selected Row. “ Alternatively, I could enter a different identifier into the Field Details > Name text box. I click Next and enter a meaningful name for this schema artifact; clicking Finish closes the wizard and saves the artifact.
Now I open the schema artifact in the schema editor. When I first view the schema’s content, I note that the types of all attributes in the Semantic Type (the right-hand panel) are strings and that the attribute names are the same as the field names in the data file. To change an attribute’s name and type, I highlight the attribute and click Edit in the Attributes grouping on the Schema > Edit tab of the editor’s ribbon bar. This opens the Edit Attribute window; I can change the attribute name and select the desired type from the “Data type” drop down control. In this example, I change the name of each attribute to the name of the corresponding database table column (EmployeeID, StartingDate, TerminationDate, JobDescription, DepartmentID, and FinalSalary). Then for the EmployeeID and DepartmentID attributes, I select Integer as the data type, for the StartingDate and TerminationDate attributes, I select Datetime as the data type, and for the FinalSalary attribute, I select the Decimal type.
But I can do much more in the schema editor. For the datetime attributes, I can set a constraint that ensures that the data adheres to some predetermined specifications; a starting date must be later than January 1, 1980 (the date on which the company began operations) and a termination date must be earlier than 11:59 PM on December 31, 1999. I simply select the appropriate constraint and enter the value (1980-01-01 00:00 as the starting date and 1999-12-31 11:59 as the termination date).
As a last step in setting up these datetime conversions, I edit the mapping, describing the format of each datetime type in the source file.
I highlight the mapping line for the StartingDate attribute and click Edit Mapping in the Mappings grouping on the Schema > Edit tab of the editor’s ribbon bar. This opens the Edit Mapping window in which I either enter, or select, a format that describes how the datetime values are represented in the file. Note the use of Y01 as the syntax for the year. This syntax is the indicator to expressor Studio to derive the century by setting any year later than 01 to the 20th century and any year before 01 to the 21st century. As each datetime value is read from the file, the year values are transformed into century and year values.
For the TerminationDate attribute, my format also indicates that the datetime value includes hours and minutes.
And now to the Salary attribute. I open its mapping and in the Edit Mapping window select the Currency tab and the “Use currency” check box. This indicates that the file data will include the dollar sign (or in Europe the Pound or Euro sign), which should be removed.
And on the Grouping tab, I select the “Use grouping” checkbox and enter 3 into the “Group size” text box, a comma into the “Grouping character” text box, and a decimal point into the “Decimal separator” character text box.
These entries allow the string to be properly converted into a decimal value.
By making these entries into the schema that describes my input file, I’ve specified how I want the data transformed prior to writing to the database table and completely removed the requirement for coding within the data integration application itself.
Assembling the data integration application is simple. Onto the canvas I drag the Read File and Write Table operators, connecting the output of the Read File operator to the input of the Write Table operator.
Next, I select the Read File operator and its Properties panel opens on the right-hand side of expressor Studio. For each property, I can select an appropriate entry from the corresponding drop down control. Clicking on the button to the right of the “File name” text box opens the file system location specified in the file connection artifact, allowing me to select the appropriate input file. I indicate also that the first row in the file, the header row, should be skipped, and that any record that fails one of the datetime constraints should be skipped.
I then select the Write Table operator and in its Properties panel specify the database connection, normal for the “Mode,” and the “Truncate” and “Create Missing Table” options. If my target table does not yet exist, expressor will create the table using the information encapsulated in the schema artifact assigned to the operator.
The last task needed to complete the application is to create the schema artifact used by the Write Table operator. This is extremely easy as another wizard is capable of using the schema artifact assigned to the Read Table operator to create a schema artifact for the Write Table operator. In the Write Table Properties panel, I click the drop down control to the right of the “Schema” property and select “New Table Schema from Upstream Output…” from the drop down menu.
The wizard first displays the table description and in its second screen asks me to select the database connection artifact that specifies the RDBMS in which the target table will exist. The wizard then connects to the RDBMS and retrieves a list of database schemas from which I make a selection. The fourth screen gives me the opportunity to fine tune the table’s description. In this example, I set the width of the JobDescription column to a maximum of 40 characters and select money as the type of the LastSalary column. I also provide the name for the table.
This completes development of the application. The entire application was created through the use of wizards and the required data transformations specified through simple constraints and specifications rather than through coding. To develop this application, I only needed a basic understanding of expressor Studio, a level of expertise that can be gained by working through a few introductory tutorials. expressor Studio is as close to a point and click data integration tool as one could want and I urge you to try this product if you have a need to move data between files or from files to database tables.
Check out CSVexpress in more detail. It offers a few basic video tutorials and a preview of expressor Studio 3.5, which will support the reading and writing of data into Salesforce.com.
Reference: Pinal Dave (http://blog.SQLAuthority.com)