Diego Villalpando's Blog

What is the Best File Management System?

2023-05-06T00:00:00+00:00

Introduction

I’ve always said that less is more. And in today’s rant, I’d like to elaborate on why having an application serving (or even working with) files using a File Manager System, is such a bad idea.

I do believe that we are living in the future, and we have an enormous array of applications (dependencies, frameworks, you name it) at our disposal that help us not re-invent the wheel each time we want to develop a new feature, or even modernizing legacy systems. I mean, why bother reinventing the wheel when we have access to Michelin or Pirelli wheels already at our disposal? Even if we don’t want to pay for top-notch wheels with guarranty and support, in the Software Engineering world we always have Open-Source options.

There are frameworks that help us to use a File Management System in a secure way. But it is up to every developer (and architect) to have special focus on security when working with those. A simple wrong assumption or bad development practice can led your application to expose the File System you are using, along with your probably confidential and sensitive data, and your dignity and reputation as well.

Serving files through applications is really useful in some cases, they may allow an application to import or export data in a friendly and structured way without using an API for less tech savvy final users (specially in a B2C setting). And just passing files from one place to another may seem like the fastest and easiest solution for this purpose, but at what cost? It will not take much time for you to realize that the security hardening of such system will take a lot of effort, time, and money.

Concerns with File-based Application Storage

Lack of Data Integrity and Consistency: File systems do not provide built-in mechanisms for ensuring data consistency, enforcing referential integrity, or handling concurrency. Ensuring data integrity and consistency becomes the responsibility of the application or developers. Don’t you want some ACID (Atomicity, Consistency, Integrity, Durability)?
Limited Search and Query Capabilities: File systems lack advanced querying and search functionalities compared to databases. Performing complex searches, filtering, or sorting operations on files becomes challenging without implementing challenging and complex custom solutions (That will probably abuse sketchy batch or bash scripts).
Access Control and Security Challenges: File systems often have limited access control capabilities compared to databases. It can be challenging to enforce fine-grained permissions or track user actions on files. Additionally, securing files against unauthorized access or tampering requires additional measures beyond file system-level controls.
Vulnerabilities and Security Risks: Common Weakness Enumerations (CWE) such as 22 to 59, 61 to 67, 69, 72, 73, 98, 219, 378, 379, 426, 427, 428, 434, 552, 538, 541, and 775, highlight the potential dangers associated with file-based storage. By avoiding files altogether, we can mitigate these risks and enhance the overall security posture of our application by design.
Added Maintenance Overhead: Most file management systems, especially Windows or Linux operating systems, require an additional layer of maintenance. This includes regular updates, patches, security configurations, and monitoring to ensure the stability and security of the operating system. And really, if we are building a new application, are we really setting up a full featured Linux or Windows server for our application these days? Probably serving a minimal Docker image will suffice.

Advantages of Using a Document-based Database

A document-based database, such as MongoDB or CouchDB, offers an alternative approach to file management. Storing data as documents provides several advantages. It allows for flexible schema design, efficient querying and indexing, seamless scalability, and built-in replication for high availability:

Structured Data Management: Databases provide a structured approach to data storage, allowing for efficient querying, indexing, and searching. This is especially beneficial when dealing with large amounts of data or complex data relationships.
Improved Data Consistency: Databases enforce data consistency by supporting transactional operations. This ensures that data remains accurate and reliable, even in concurrent access scenarios or during system failures.
Access Control and Security: Databases offer built-in access control mechanisms, allowing administrators to define user roles and permissions. This ensures that only authorized individuals can access, modify, or delete data. Additionally, databases provide security features like encryption and auditing to protect sensitive information.
Scalability and Performance: Databases, including MongoDB, are designed to handle scalability and performance requirements. They can efficiently manage large datasets, distribute data across multiple nodes, and handle high volumes of read and write operations.
Flexibility in Data Modeling: MongoDB, being a NoSQL database, offers flexibility in data modeling. It allows for dynamic schema design, accommodating evolving data structures and facilitating agile development. This can be advantageous when dealing with unstructured or semi-structured data.
Implementing Data Versioning Systems: Versioning files stored in a document-based database is often easier and more manageable than in a file system. With the right database configuration and structure, it becomes straightforward to track and manage different versions of data. This simplifies collaboration, auditing, and rollback scenarios, ensuring data integrity and reducing complexity.
Migrating Data: One of the significant advantages of moving away from file-based storage is the ease of data migration. Whether it’s for disaster recovery purposes, system upgrades, or troubleshooting, migrating data stored in a document-based database is typically simpler than dealing with file systems. The structured nature of the database allows for seamless transfer, transformation, and consolidation of data across different environments.

Conclusion

Relying on an application to serve or manage files through a file management system poses significant security risks and drawbacks. While there may be some cases where serving files through an application seems convenient, the disadvantages outweigh the benefits.

By avoiding the use of file management systems and embracing document-based databases, we can mitigate security risks, enhance data integrity, and streamline data management processes. It’s a step towards a more secure and efficient application architecture.

Sometimes the best file management system is simply not having files altogether.

Looking for the comments? Feel free to leave a comment on LinkedIn or GitHub. :speech_balloon:

Understanding Pentaho Data Integration: Jobs and Transformations

2023-04-29T00:00:00+00:00

Contents
Introduction
Differences between Jobs and Transformations
Shared Elements
Conclusion
References

Introduction

Pentaho Data Integration (PDI) is a powerful open-source ETL tool that users can use to extract, transform, and load data from various sources into a target system. In PDI, a typical ETL workflow consists of Jobs and Transformations.

A Transformation represents the data transformation part of an ETL workflow, and performs data manipulation tasks. Like extracting data from various sources, performing calculations, filtering, sorting, joining, and writing data to different destinations. It is designed to be reusable and can be called from a job or another transformation.

On the other hand, a Job represents the orchestration part of an ETL workflow and handles the sequence of tasks to be executed. It can call one or more Transformations or Jobs, execute shell scripts or database operations, check for conditions, define dependencies between tasks, and handle error and recovery scenarios.

In simpler terms, a transformation performs data manipulation tasks, while a job orchestrates the sequence of tasks to be executed in an ETL process.

Differences between Jobs and Transformations

A key difference between a job and a transformation is that a transformation operates on a single set of data at a time, while a job can execute many transformations, each with its set of data. A job can also perform tasks that are outside the scope of a transformation, such as sending notifications, file management, or writing logs.

Both Jobs and Transformations contain various components to create robust ETL workflows:

Job Entries represent the different tasks or processes that need to be executed as part of the job. Job entries can include Transformations, shell scripts, database operations, file operations, and more. Each job entry performs a specific task and can be configured with various options and parameters.

Transformation Steps represent the actions performed on the data during the ETL process. Each step performs a specific operation on the data (whether it is to read, transform, or write) and can be configured with various options and parameters to achieve the desired results.

Shared Elements

Both Jobs and Transformations are also composed of:

Hops: These are the lines that connect Job Entries or Transformation Steps and define the flow of execution between them. They can be used to connect different components in a linear or non-linear fashion, allowing for complex workflows.
Parameters: These are variables that users can pass into a Job or Transformation at runtime, allowing for greater flexibility and reusability. They can configure different aspects, such as database connections or file paths, without the need to define them each time.
Variables: These are user-defined variables that can store and manipulate data within the Job or Transformation. Variables can be used to store temporary data, configure options, or perform calculations.
Data: These are not present in KJB or KTR files, since they are generated or obtained through input Steps at runtime. They are passed between Entries or Steps through Hops as table rows (Like a relational DB).

Both Parameters and Variables can be recalled in Entries and Steps following this format: ${PARAM_NAME} or %%PARAM_NAME%%.

Conclusion

Understanding the differences between Jobs and Transformations in Pentaho Data Integration is crucial for building effective and efficient ETL workflows. While both Jobs and Transformations are vital components of a successful ETL process, they have different purposes and are used in different ways. Jobs are used to orchestrate the execution of tasks and can perform tasks outside the scope of Transformations, while Transformations perform data manipulation tasks. By utilizing both Jobs and Transformations effectively, users can extract, transform, and load data from various sources into a target system seamlessly.

References

Hitachi Vantara (Mar/2018). Basic Concepts of PDI: Transformations, Jobs and Hops. https://help.hitachivantara.com/Documentation/Pentaho/8.0/Products/Data_Integration/Data_Integration_Perspective/010
Hitachi Vantara (Dec/2017). Parameters. https://help.hitachivantara.com/Documentation/Pentaho/8.0/Products/Data_Integration/Data_Integration_Perspective/050/010
Hitachi Vantara (Dec/2017). Variables. https://help.hitachivantara.com/Documentation/Pentaho/8.0/Products/Data_Integration/Data_Integration_Perspective/050/020

Looking for the comments? Feel free to leave a comment on LinkedIn or GitHub. :speech_balloon:

How to Install Pentaho Data Integration

2023-04-23T00:00:00+00:00

Introduction
Which version should I use?
How to install PDI
Conclusion
References

Introduction

Pentaho Data Integration, or PDI, is a powerful open-source ETL tool. It extracts, transforms, and loads data for many applications. PDI has a visual interface to create data integration jobs. Users define data sources, manipulate data through functions, and specify the load location. In real-life, PDI combines data from various sources for analysis, reporting, or warehousing.

Examples of PDI usages include:

migrating legacy systems to modern databases,
financial reporting automation,
creating data feeds for web apps,
consuming and generating message queues,
creating and managing Big Data jobs,
and even machine learning.

One important note, there are two main versions of PDI available. Pentaho Enterprise Edition, and community Pentaho Data Integration. Some key differences between the two versions include:

Support: Paid PDI version comes with technical support, and the community version relies on community forums.
Features: Paid PDI includes advanced features such as big data integration with Hadoop, Spark, and NoSQL databases, job orchestration, and workflow automation, which are not available in the community version.
Scalability: Paid PDI is designed to handle larger volumes of data and can scale to meet the needs of larger enterprises, while the community version is suitable for smaller-scale projects.
Security: Paid PDI includes enhanced security features such as data encryption, user authentication, and access controls, which are not available in the community version.
Licensing: The community version is open-source and free to use, while the paid version requires a license fee.

In this blog I will focus only in PDI community version. It is worth mentioning that PDI community version is enough for many enterprise uses. You don’t need the enterprise version for most use-cases, there’s always a workaround for CE limitations.

Which version should I use?

As any big application, PDI has Long-Term-Support and Short-Term-Support versions. I would suggest you always use an LTS version. Unless you are working with a legacy PDI implementation. Or you want to have the cutting-edge features.

You can check the support policy per version here.

How to install PDI

If you don’t have java installed, download the required Java version from your preferred JDK provider:

For PDI version 3.x and 4.x, use Java 1.6.
For PDI version 5.x, use Java 1.7.
For PDI version 6.x and later, use Java 1.8.

Note that PDI may not work properly with later versions of Java, so it’s recommended to use the specific Java version required for each version of PDI.

Head over to SourceForge Pentaho Community Edition site,
Select the version you wish to install (Pentaho-X.X),
Select client-tools,
Click on pdi-ce-X.X.x.x-xxx.zip to download,
Before extracting the zip file, open a file explorer and go to your desired installation path,
Create a folder called pentaho, and then a design-tools folder within,
Extract the zip on a folder with the zip name, within the design-tools folder,
Your installation path might look like:
- Linux: ~/pentaho/design-tools/pdi-ce-X.X.x.x-xxx
- Windows: C:\Program Files\pentaho\design-tools\pdi-ce-X.X.x.x-xxx
- Mac: Applications/pentaho/design-tools/pdi-ce-X.X.x.x-xxx
[Optional] If you have or plan to have other Java versions, add to your environment variables:
- PENTAHO_JAVA_HOME: put the path to the required Java installation folder.
[Optional] There will be a data-integration folder within your installation directory. Open it, and create a shortcut to spoon.sh if you are using Linux or Mac, or to spoon.bat if using Windows. Rename it to “PDI ce X.X” where ce is community edition, and X.X its version. Drag it to your desired location.
- In Windows, you might want to copy the shortcut to C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Pentaho for it to appear in your startup menu.
- If you are a terminal enthusiast, you might want to add the path to the appropriate spoon file to your PATH environment variable, or to your bash profile.

You should be able to open Pentaho Data Integration, by executing your shortcut, or the spoon file.

Conclusion

Not having an official installer, and having to install PDI manually might seem daunting. But as you now know, it’ll only take a few minutes. Feel free to install multiple PDI versions using this same post.

Now you are able to explore PDI and play with it to get to know the interface and components. As you explore, it might seem like there’s a steep learning curve to start using PDI, but that is far from the truth. With Pentaho Data Integration at your disposal, you’ll be well-equipped to handle even the most complex data integration challenges, it’s a valuable tool that can help you streamline your data integration processes and improve your overall ETL workflow.

Coming next on this blog series, we will learn the PDI inner workings and components. Keep an eye for the next posts.

References

Hitachi Vantara (Dec/2022). Installing Pentaho Data Integration CE. https://www.hitachivantara.com/en-us/pdf/implementation-guide/three-steps-to-install-pentaho-data-integration-ce.pdf
Hitachi Vantara (No date). Pentaho Community Edition (CE) Installation Guide for Windows - Whitepaper. https://www.hitachinext.com/en-us/pdf/white-paper/pentaho-community-edition-installation-guide-for-windows-whitepaper.pdf
Hitachi Vantara (Jun/22). Install the PDI tools and plugins. https://help.hitachivantara.com/Documentation/Pentaho/9.3/Setup/Install_the_PDI_tools_and_plugins#cp_pentaho_perform_manual_install_of_pdi_design_tools_and_plugins

Diego Villalpando's Blog

What is the Best File Management System?

Introduction

Concerns with File-based Application Storage

Advantages of Using a Document-based Database

Conclusion

Looking for the comments? Feel free to leave a comment on LinkedIn or GitHub. :speech_balloon:

Understanding Pentaho Data Integration: Jobs and Transformations

Contents

Introduction

Differences between Jobs and Transformations

Shared Elements

Conclusion

References

Looking for the comments? Feel free to leave a comment on LinkedIn or GitHub. :speech_balloon:

How to Install Pentaho Data Integration

Introduction

Which version should I use?

How to install PDI

Conclusion

References

Looking for the comments? Feel free to leave a comment on LinkedIn or GitHub. :speech_balloon: