Blog Archives

How to get latest file from directory

This post I will describe, how to get most resent file from directory based on display date or a date from file name.

We have below sample files in our directory and every file has date in the name of file, based on that we will decide which file is most resent rather than file created date/ modified date.

Sample xls files

Sample xls files

As you can see, we have three list of files.

  1. sales element 11 2014.xls has been modified at 28-01-2015
  2. sales element 02 2015.xls has been modified at 28-01-2015
  3. sales element 12 2014.xls has been modified at 03-03-2015

If we use file created or modified date to get most resent file then we will get ” sales element 12 2014.xls” which is a wrong file.

To get a latest file from directory we will use below steps.

Step 1: Add tFileList component and configure it get all .xls files form directory. see the image for details.

tFileList Configuration

tFileList Settings

Step 2: Add tFileProperties component and connect with tFileList using Iterator link, then provide file path and name from global variable. which looks like this ((String)globalMap.get(“tFileList_1_CURRENT_FILEPATH”)).

tFileProperties Setting

tFileProperties Setting

Step 3: Add tMap after tFileProperties and connect with main link and do the fowling setting in it.

  • Create output name as “FileList”.
  • Add all the source columns to this output.
  • Add new variable in tMap using variable creation, write this code in it.

row9.basename.substring(row9.basename.indexOf(“.”)-7).replace(“.xls”, “”)

  • Create new column in output with the name “DisplayDate” and datatype is Date.
  • Add below code in it.

TalendDate.parseDate(“MM yyyy”, Var.var1)

  • See the image for more details.
tMap Setting

tMap Setting

Step 4: Add tHashOutput component after tMap and connect with main link.

Step 5:  Add tHashInput Component below tFileList and link using “OnSubJobOk” trigger.

Step 6: Copy Schema from tHashOutput to tHashInput.

Step 7: Add tAggregateRow component and connect with tHashInput using main flow link. Do the basic setting like below.

tAggregateRow Setting

tAggregateRow Setting

Step 8: Add tLogRow to check the result. you will see the output as below.

Output Result

Output Result

Step 9: Your job design should be look like in below Image.

Final Job Design

Final Job Design

Note: You can avoid using tHash***** components just use tAggregateRow after tMap and do the setting as is, it will work.

Validate CSV headers in Talend.

File processing is a day to day task in ETL world, and there is huge need of validation regarding source file format, headers, footers, column name, data type and so on, thanks to tSchemaComplianceCheck component which can do most of the validation like.

  • Length checking.
  • Date pattern/format.
  • Data Types

But does not support number of columns and column sequence validation, that which we have to manage using java code, in this post I will describe you how to validate column names and their sequence.

This is our final job design.

Complete CSV Validation Job

Complete CSV Validation Job

Let’s start with adding first component to job designer. Add tFileList and configure to get expected files.

Add tFileInputFullRow component and configure as shown in below screen.

tFileInputFullRow Configuration

tFileInputFullRow Configuration

  • Add tMap and connect with main link from tFileinputFullRow component.
  • Add tFixedFlowInput and connect with lookup link to tMap then configure as follows.

Note: if you have your refrence header row stored in file or database you can use it instead of tFixedFlowInput.

tFixedFlowInput Configuration

tFixedFlowInput Configuration

  • Configure tMap as follows.
    • Make inner join with your reference line and main line of input.
    • Add two outputs and named it as “matching” and “reject” respectively.
    • In the reject output click on setting “catch lookup inner join reject”=true
    • Add source line to both the flows.

See image for more details.

tMap Setting

tMap Setting

  • Add tJava next to tMap and connect with “matching” flow.
  • Add another tJava next to tMap and connect with “reject” flow.
  • Add tFileInputDelimited and connect with first tjava using “iterate” flow.
  • Configure tFileInputDelimited as shown in below image.

Add tLogRow component to see the output from file.

tFileInputDelimited Configuration

tFileInputDelimited Configuration

You can see that for each file whole sub job will be executed if it is matching with header row then it will be used for reading.

You can connect reject row to make a note of rejected file based on your requirement.

Difference between tMap and tJoin

tMap is frequently used component for joins and lookup purpose, it is also use for verity of operations and transformations, whereas tJoin is used for join and lookups only.

tMap

tJoin

It accepts more than one input one is main and rests of the lookups.

It accepts only two inputs and only one is main and other one is lookup.

We can create more than one output

It has two default outputs one is “Main” and another one is ” Inner join reject”

tMap has “inner join ” and ” left outer join” joining model

tJoin offer`s only “inner join”

tMap offers three match model

  1. Unique Match
  2. First Match
  3. All Matches

tJoin defaulted with Unique match

tMap allows to store data on file option for lookup data processing

tJoin doesn`t offer this feature

In tMap you can filter data using filter expression

tJoin doesn`t offer this feature

You can write transformation using expression builder at each column level

tJoin doesn`t offer this feature