NiFi is an easy-to-use, powerful, and reliable system to process and distribute data. So, how can we consolidate our organization’s data team using this one specific tool?
We recently shared our perspective on how to build a data team. Because we need to always try to simplify the process for them to work together, here is an option we can implement by using only one tool.
Just to clarify, there are thousands of data tools that can be synced together to make your data platform. Matt Turck already gathered a lot of information for us with his Data Landscape, but we are going to focus on the use of one tool, in order to guide you into reducing your data stack.
Data teams
In our past article we have already mentioned the skills you should be looking for when hiring, and hopefully your team looks like this now:
- Data Engineers
- DataOps
- Data analysts
- Data Stewards
In our experience, we have noticed that by using Apache NiFi, you can work with your data on multiple levels and, of course, teams. So, let’s get into it.
What is NiFi?
NiFi is “An easy to use, powerful, and reliable system to process and distribute data.” and has also been called by Cloudera as a “Low code streaming data processing”, which means, you can build Data Flows with very little effort, but, in the era of video, you can see it for yourself.
High-level features in NiFi are:
- Web-based user interface
- Highly configurable
- Data provenance
- Designed for extension
- Secure
Features we have enjoyed while working with it:
- Streamlined work
- Easy to support
- Easy to build one-time off solutions
- Easy to run locally
- Granular access
- Scalable
- Resilient
- On-prem to Cloud easy integration
How will it help integrate your teams?
Now that we have a glimpse of what NiFi is, we want to focus on two teams that can work together on top of NiFi, Data Engineers and DataOps.
NiFi for Data Engineers
In today’s world a Data Engineer has a wide range of responsibilities on the Data Stack, from building the platform to creating the custom pipelines for that specific business needs that just came up, this role is very often randomized.
We understand that randomization is not something we want, it just happens in every organization because of “Time to Market”. This is the reason why we need to speed up development for those moments where we want data right away.
In this sample, we can see a flow that was built in about 2 minutes (just because we had to get secret keys from Twitter). These are the steps it follows:
- Listen to twitter for a specific hashtag
- Send it to S3
This is, of course, a very simple flow. It could get way more complex than this but we are not talking about all that NiFi can do, we are interested in how we can allow teams to work together using this app.
As you can see data is already flowing and we are storing tweets based on a hashtag we decided to follow, into the bucket we wanted. But the important part is we can now ship this flow to prod and hand it over to DataOps Engineers to make sure it is working, or change it as required.
In summary, the Data Engineer randomization took 2 minutes of their time to deliver the tweets we have to analyze.
NiFi for DataOps
Based on the example above, what can DataOps do? Let’s see some examples:
- Change the hashtag to follow
- Change the bucket to put tweets in
- Re-post tweets in case they are deleted
So yes, this tool gives DataOps some level of freedom. But what happens when you have hundreds of flows running in your NiFi Instance, this is how it looks:
It is impossible to manage that, NiFi’s canvas is just too convoluted by having all flows in it. You can be more organized, add ways to keep things simple and apply every best practice, but the canvas will keep growing every time you have planned or randomized flows created.
Of course, the NiFi team already thought about it, and added a feature for DataOps (or support teams) who need to locate a single flow quickly or simply monitor what’s going on in that instance.
As you can see in the top right corner there is a menu icon that holds some interesting options in NiFi:
I just want to emphasize a few of them that will help DataOps teams handle NiFi easily:
- Summary: Shows the canvas but in a list format, either per processor (like GetTwitter) or per entire flow (Process Group). It has a filter where you can just put the name and then click on the right arrow icon on the right side to get redirected inside of the flow.
- Bulletin board: Shows the error log, so in case of failure you can get there and just see what’s going on in the entire NiFi instance.
- Data Provenance: A centralized place where every change applied to the data is recorded. It keeps track and stores every event that occurs in the flow like getting file, selecting few columns, converting to parquet, among others. Everything gets in there with the same right arrow icon that will drive you to it. This feature is at processor level.
- Flow configuration history: This is a record of everything that a user has done in NiFi (except for login in). So, if a user starts, stops, disables, changes the configuration or creates a new flow, it will be registered here with the user’s identifier and every step done.
Conclusion
At this point, I think you may be looking into, at least, having a NiFi PoC in your stack. And this is just one of the possible applications of NiFi to consolidate your organization’s data team.
Keep in mind NiFi currently (as of version 1.16.2) has 309 pre-built processors that allow you to:
- Use regular databases, such as SQL Server, MySQL, Postgres, etc.
- Use no-SQL databases like MongoDB or Elasticsearch
- Use “Big Data” ecosystems with Hadoop, Hive, Presto, Spark, Kafka, etc.
- Interact with multiple cloud services in AWS, Azure and GCP
- Convert data into multiple data formats csv, json, parquet, avro, etc.
- Act as a socket for message handling
- Act as an API handler
- Perform predefined web scraping
- Receive and send emails
So, does this sound good to you? If the answer is yes, then you can download it and try it out from here: https://nifi.apache.org/
Comments? Contact us for more information. We’ll quickly get back to you with the information you need.