Importing Excel Data into SQL Server Via SSIS: Questions You Were Too Shy to Ask
The Questions
- “When should I use an Excel connection manager rather than an Excel Source component?”
- “I’ve tried changing the data types that the Excel Source component assigns to the incoming data but have had no luck. How do you control data types in the source component?”
- “My SSIS package retrieves data from an Excel worksheet and inserts the data into a SQL Server table. My data flow shows a warning message indicating that data could be truncated; however, the package runs with no problem and loads the data into the table. Do I need to do anything in response to the warning message?”
- “How do I limit my result set to specific columns when retrieving data from an Excel workbook?”
- “I sometimes receive a warning on my Excel Source component saying that the external columns are out of sync with the data source columns. Any idea what’s going on?”
- “One of the columns in the Excel worksheet I use as a data source contains NULL values for some rows. How do I send those rows to a text file, rather than loading them into the target database table?”
- “I built an SSIS package that retrieves data from an Excel workbook. However, when I try to debug the package I receive an error and a suggestion that I run the package in 32-bit mode. How do I do that?”
- “How do you import a range of cells from an Excel worksheet, rather than all the data on the worksheet?”
- “My SSIS package includes a Data Conversion transformation for reducing the size of the Unicode string columns that are retrieved from an Excel spreadsheet. Is there a way to redirect rows that contain values too large for the new column size?”
- “How do I join data from multiple worksheets within a single Excel file?”
- “I’m using SSIS to retrieve data from an Excel worksheet and inserting the data into a SQL Server table. The table includes several VARCHAR columns. I’m receiving an error message saying that I cannot convert Unicode data to non-Unicode data. What steps do I have to take to transform the data?”
- “In my data flow, I’m converting data I retrieve from an Excel spreadsheet to the VARCHAR data type. How do I handle errors for the data that will not convert?”
- “How do I retrieve data from multiple Excel workbook files?”
“When should I use an Excel connection manager rather than an Excel Source component?”
It’s not a question of one over the other. If your SSIS package retrieves data from an Excel workbook, you need both the connection manager and the source component. The connection manager serves as the interface between the package and the Excel file, or more precisely, the source component uses the connection manager to access the data in the Excel file.
When setting up the connection manager, you need only provide the path and file name for the Excel workbook, select the Excel version, and specify whether the first row should be treated as column names, as shown in Figure 1.
Figure 1:Configuring anExcelconnectionmanagerto pointto an Excel workbook
You then need to configure theExcelSourcecomponent to use the connection manager to access data in the Excel file. You cannot configure any other settings in the source component without first specifying the applicableExcelconnection manager. Figure 2 shows theConnectionManagerpage of theExcelSourceEditor. As you can see, the first option listed is for the connection manager. From there, you select the data access mode and any relevant information specific to the selected mode.
Figure 2:TheConnectionManagerpage of theExcelSourceEditordialog box
TheExcelSourcecomponent supports four types of data access modes:
- Table or view:Specifies that an Excel worksheet be used for the source data. You must also select the applicable worksheet.
- Variable name:References a variable that contains the name of the worksheet or a named range. You must also select the applicable variable.
- SQL command text:Specifies that an SQL statement be used to access data from the target Excel worksheet. You must also provide the actual statement. You can also parameterize the query, use a GUI to build the query, browse for a file that contains the query, and parse the query after you enter it.
- SQL command from variable:References a variable that contains an SQL statement.
All references to worksheet names in theExcelSourcecomponent include a trailing dollar sign ($). In addition, whenever referencing a worksheet in a SQL statement, you must delimit the worksheet name, usually by enclosing the name in brackets.
Another useful option on theConnectionManagerpage isPreview, which lets you view all or some of the data returned from your spreadsheet, based on the selected data access mode. This option provides a handy way to spot-check the data to help confirm that you’re retrieving the correct type of information.
TheColumnspage of theExcelSourceEditordetermines which columns should be imported from the Excel spreadsheet and what to name the output columns, if you want those names to be different from the column names returned from the spreadsheet. Figure 3 shows aColumnspage configured to use the same column names as those retrieved from the spreadsheet.
Figure 3:TheColumnspage of theExcelSourceEditordialog box
The column names retrieved from the spreadsheet are considered the external columns. If you had selected theFirstrowhascolumnnamesoption in yourExcelconnection manager, the names will come directly from the first row of the returned data; otherwise, generic worksheet names will be used, such asF6,F7,F8, and so on.
The final page in theExcelSourceEditorisErrorOutput, which lets you configure how errors or truncations are handled should any occur. Figure 4 shows the default settings for each column when you first set up anExcelSourcecomponent.
Figure 4:TheErrorOutputpage of theExcelSourceEditordialog box
By default, any errors or truncations result in component failure. However, you can instead choose to ignore errors or truncations or to output the rows that contain errors or truncations to a specific destination, such as a flat file, where you can then examine the outputted data.
As pointed out earlier, however, before you can configure any options in theExcelSourcecomponent, you must create your connection manager to the Excel workbook. Only then can you configure the access mode, column setup, and error handling.
“I’ve tried changing the data types that the Excel Source component assigns to the incoming data but have had no luck. How do you control data types in the source component?”
The driver used to retrieve Excel data provides little in the way of control, when it comes to assigning the initial data types. By default, the driver samples the first eight rows of data and determines the data type from there. The incoming data is limited to one of the following SSIS types:
- Double-precision float (DT_R8)
- Currency (DT_CY)
- Boolean (DT_BOOL)
- Date and time (DT_DATE)
- Unicode string with a length of 255 (DT_WSTR)
- Unicode text stream (DT_NTEXT)
Figure 5 shows theAdvancedEditorfor anExcelSourcecomponent. The component is retrieving data from a simple spreadsheet that contains five columns. The first column (EmpID) contains all integers, so theDataTypeproperty is automatically configured with the valuedouble-precisionfloat[DT_R8]. Although you can change the property setting, the change won’t stick. SSIS always generates the data type for you when it validates the data based on the first eight rows.
Figure 5:Verifying the data types in the source component’s advanced editor
If a column contains mixed data types, SSIS often takes a “majority rules” approach when assigning the type. For example, if the column contains five integers and three text values, SSIS will again assign theDT_R8data type to that column. SSIS will also assign the numerical type to a tie, which means a four/four split will also result inDT_R8. As a consequence of this approach, all non-numerical values are returned asNULLvalues.
It gets worse. Imagine you have a column that’s a mix of numbers and text, and you want theDT_WSTRdata type to be used, but if the first eight rows contain four or more numerical values, you’ll getDT_R8whether you like it or not.
One way to address this issue is to update the column in the worksheet so that it is configured asText. That way, even if all eight values are numerical, SSIS will assign theDT_WSTRdata type to that column. The challenge with this approach is that it works only if you have access to the file or a say in how the file is generated. If you’re working with an automated system, such an approach could be very difficult, unless you control how the files are created.
Another option is to add theIMEX=1option to the connection string used by the connection manager to access the Excel file. The option tells the driver to useImportmode, which takes into account the registry settingImportMixedTypes=Text. This forces mixed data to be converted to text. Be aware, however, that usingIMEX=1can lead to unpredictable results if you try to update the data.
You can also manipulate how SSIS assigns data types by modifying the registry settingTypeGuessRows, which by default is set to8. This is the setting that determines that only the first eight rows be sampled. You might consider settingTypeGuessRowsto1to help narrow the focus of how types are determines. However, that means you’d have to ensure that your Excel files contains the correct type of data in the first row. This approach also assumes you have access to the registry.
Even if you can access to the registry, be sure to take into account all relevant operations before changing theTypeGuessRowssetting. Other SSIS components and packages, as well as other systems, might be relying on the current setting. For example, if you change the setting to1and the column’s first value is under 255 characters, SSIS assigns theDT_WSTR(255)type to the imported column. Then, if a subsequent value in that column is over 255 characters, that value could be truncated if inserting the data into a SQL Server table. In fact, depending on the nature of your data, you might actually want to configure theTypeGuessRowssetting with a larger value rather than a smaller one.
Clearly, there is no easy solution when it comes to data types and Excel data. Your best bet is to make sure your Excel workbooks are formatted to meet your needs. The more you can work within the constraints of how data types are assigned by default in SSIS, the easier your life will be.
“My SSIS package retrieves data from an Excel worksheet and inserts the data into a SQL Server table. My data flow shows a warning message indicating that data could be truncated; however, the package runs with no problem and loads the data into the table. Do I need to do anything in response to the warning message?”
Figure6:Converting data to avoid truncation
When SSIS retrieves data from an Excel workbook, it assigns one of two character data types to each text-based column:DT_WSTRorDT_NTEXT. By default, if at least one of the column’s first eight values is over 255 characters, SSIS assigns theDT_NTEXTdata type to the column; otherwise, SSIS assignsDT_WSTR.
In many cases, SSIS assigns theDT_WSTRdata type to a column whose values are much smaller than 255. Not surprisingly, you might be inserting that data into a destination column with a smaller length. For example, suppose you’re insertingDT_WSTR(255)data into a SQL Server column configured with theNVARCHAR(50)data type. If you were to create a simple data flow that contained anExcelSourcecomponent and anOLEDBDestinationcomponent, the SSIS package would run just fine, although you would receive a warning message on the destination component indicating that values could be truncated. In other words, any value with a length greater than 50 could be truncated when inserting that value into the database.
If data is being truncated, it might point to issues with your source data that need to be resolved. You can avoid any silent truncation by adding aDataConversiontransformation to your data flow, as shown in Figure 6. In this way, the package will at least fail, rather than data being inadvertently truncated.
Essentially, the transformation component converts eachDT_WSTR(255)value to aDT_WSTR(50)value. Figure 7 shows what this conversion setup would look like for the three columns inDataConversionTransformationEditor.
Figure 7:Configuring data conversionsin theDataConversionTransformationEditor
In this case, we’re simply specifying a different length and assigning an output alias to each column. We would then use these aliases when referencing the columns in subsequent data flow components. This eliminates the truncation warnings and ensure that no values with a length greater than 50 gets inadvertently truncated. Instead, the package will fail until you fix the data. Chances are, you might also want to include the logic to handle the offending data, rather than simply failing the package, but that’s a different topic altogether.
“How do I limit my result set to specific columns when retrieving data from an Excel workbook?”
The easiest way to limit the columns you retrieve from an Excel worksheet is to specify which columns to include or not to include in theExcelSourcecomponent.
By default, when you first set up the component, all columns in the worksheet are selected, but if you go to theColumnspage of theExcelSourceEditor, you’ll find you can deselect any of the columns in theAvailableExternalColumnsbox, as shown in Figure 8
Figure 8:Selecting which columns to include in your data flow
In this case, theEmpIDcolumn has been deselected. The column is then automatically removed from the lower grid on theColumnspage.
If you remove a column from theExcelSourcecomponent after you’ve already defined other components in your data flow, you must update those components as necessary. For example, your data path might include aDataConversiontransformation that specifically references a removed column. The component will have to be updated. Same goes for any destinations that map the removed column to an external column.
“I sometimes receive a warning on my Excel Source component saying that the external columns are out of sync with the data source columns. Any idea what’s going on?”
Sometimes when you open an SSIS package, you might receive a warning on yourExcelSourcecomponent saying that the external columns are out of sync with the data source columns. This confusing message is essentially saying that the columns coming into the component are not in sync with those going out. If you run the package without making any changes, you’ll likely receive the same message in your execution results.
If you instead open theExcelSourceEditor, you’ll either be prompted to have the columns synced automatically or the syncing will occur without prompting. In either case, the problem will appear to have been resolved. If you then run the package, however, you might receive one or more warning messaging about output columns not being used in the data flow.
What all this boils down to is that your source Excel worksheet has likely changed since you first set up yourExcelSourcecomponent. For example, suppose you set up your component based on a simple worksheet that contains theEmpID,FirstName,MiddleName,LastName, andBirthDatecolumns. After creating the SSIS package, someone (or something) inadvertently adds the valuexxxto one of the unused columns in the spreadsheet, as shown in Figure 9.
Figure 9:An Excel worksheet containing erroneous data
By default, when you open an SSIS package, the external data sources are validated against your currently configured components. If everything doesn’t sync up as it should, you’ll receive a warning or error message telling you there’s a problem. If the message you receive warns you that your external columns are out of sync with the data source columns, you should take a closer look at the source component via theExcelSourceEditor, where you might discover that new columns have been added to your data flow. Figure 10 shows theColumnspage of theExcelSourceEditor, complete with theF6andF7columns.
Figure 10:Extra columns from the Excel worksheet showing up in the data flow
Notice that the source component adds a column to accommodate thexxxvalue and a column that matches the one preceding the value. Your package will still run, but you’ll receive the warning message about unused output columns. You can fix this either by updating the Excel worksheet itself or by deselecting the columns on theColumnspage of theExcelSourceEditor. Whether you take either of these steps or simply do nothing, this situation might point to something else going on with your source file. The erroneous data might be the result of a simple human error or of a problem in your automation process, in which case, you might have bigger issues to contend with than a simple warning message.
“One of the columns in the Excel worksheet I use as a data source contains NULL values for some rows. How do I send those rows to a text file, rather than loading them into the target database table?”
An easy way to redirect rows that containNULLvalues is to add aConditionalSplittransformation to your data flow that defines how to process these rows. For example, suppose we retrieve employee data from an Excel spreadsheet and send that data to a SQL Server database. The spreadsheet includes aMiddleNamecolumn that can containNULLvalues. For any rows whoseMiddleNamevalue isNULL, we want to redirect the rows to a text file so that our data flow looks something like the one shown in Figure 11.