Plant Height of Hydroponically Grown Basil

Motivation

I was first introduced to hydroponic farming when I helped an environmental engineer at the Aurobindo Ashram in New Delhi build a rooftop system in Summer 2017. As an environmentalist, I was really impressed by the water and space-conservation benefits that hydroponic farming has over traditional agriculture. As a citizen of a primarily agrarian country, I saw controlled environment agriculture as a strong potential solution to drought, shifting climate belts, food insecurity, and supply-chain management issues, which led me to start my organization Hydroponic-a, which has been broadening access to hydroponic farming systems at the grassroots level since 2017. This study serves as an opportunity for me to learn more about the inner workings of a hydroponic system with reference to a plant which has great commercial applications: basil.

Introduction

Hydroponic farming is, in simple terms, growing plants in the absence of soil with a minimal amount of water. In most hydroponic systems, however, plants are placed into small net-pots which are then fitted into PVC pipes, through which several chemical solutions (called nutrient solutions) are pumped, which supply the plant with compounds that are essential for plant growth. The hydroponic system uses only 1/10th the amount of water plants normally do, and proves to be one of the most eco-friendly methods of gardening and farming.

Any system which involves the artificial control of physical and chemical conditions for agriculture is called a Controlled Environment Agriculture system (CEA). This study will use the two terms ‘hydroponics’ and CEA interchangeably.

The Data

MIT had compiled this dataset for their Open Ag(riculture) initiative, as a part of their broader research on taste patterns for basil. This study, conducted in 2018, gathered data on physical and chemical regulations in their CEA systems from 08/10/2018 to 13/11/2018.

The key features of their research design are:

‘Farm’ was divided into 9 bays with 18 plants marked 201..210.
Each of these bays had different pH, temperature, and nutrient regulation patterns.
Plant height (in cm) and conditions were noted each week between 08/10/2018 and 13/11/2018.

Each bay is essentially an independent hydroponic unit, with different conditions being administered (baring external environmental temperature and pressure conditions). This structure is typical in traditional hydroponic farms. An image to assist with visualizing this ‘bay’ structure is attached below (Note: This is not the farm from the MIT study, but a guiding image for conceptual understanding).

Hypothesis & Research Aim

Null Hypothesis: There is no difference in the plant height between bays and pH regulation patterns have no statistically significant effect on growth.

Alternative Hypothesis: There is a statistically significant difference in plant heights between different bays, and this is in part associated with difference in pH regulation patterns.

Aims of the Study

To determine whether there is a difference across bays
Establish association between pH (and potentially other important features) and plant heights.
Discover any patterns in chemical or physical conditions which has an impact on basil growth.

Exploratory Data Analysis

For my initial exploration, I selected three columns: bays, plant height (in cm), and pH of the water in the reservoir, and cleaned the data by taking the following steps:

For NaN (undefined) values in the first week of data collection for each crop cycle, the values aren’t provided as the plants don’t have a height (shoots haven’t emerged), therefore 0 is imputed.

For NaN values anytime after- there was simply no data collection, means for the specific bay, in the specific week are imputed.

pH Regulation Patterns: The bar graphs below show how pH relates to basil plant heights in the first week (August 13, 2018), and last week (September, 11, 2018) of the first crop cycle. From this visualization, we see that the pH is higher (reservoir water is more basic) in the first week when the crops are initially planted, and higher (more acidic) in the last week of growth, when they are harvested.

Examining One Bay: For the next stage in the exploratory data analysis, I decided to zoom in on one bay, to allow for examination of plant growth throughout the study at a lower level, which otherwise becomes a complex task when considering the entire dataset.

Selecting a bay: Aim is to choose a bay which best represents the entire population of plants in the study. Bay 204 has the overall median plant height, and roughly median pH level at all stages of growth.

Cleaned Data for Bay 204: Shows the average pH and plant height for every week in which data is analyzed. From this table, we see that the average plant height suddenly drops from 44.5 on 9th September to 4.47 on 16th October. This is because after the 9th, the basil crop is harvested and the first growth cycle comes to an end. Having made this observation, the study limits all analyses and statistical testing to either one of the two crop cycles at any given point in the study.

The Average Plant: pH and Growth The line graphs visualized below give us a better understanding of the growth trajectory, and pH regulation patterns for the average plant in the bay with the median height.

Growth Trajectory of The Average Plant

pH Regulation Pattern for Bay 204

Key Outcomes of Exploratory Data Analysis

There were two harvest cycles
The average plant, at its tallest is 44 cm
pH levels are changed each week in the bays, likely to optimize for each stage of the growth cycle
Growth curve is steepest between days 18-23 when pH is 6.1 (maximum growth happens here)

Section 1: Difference Between The Heights Of Plants In Different Bays

My initial approach was to conduct systematic pair-wise A/B testing to detect a difference between bays’ heights. This resulted in me running into the Multiple Inference Problem.

Multiple Inference Problem: Performing more than one statistical inference procedure on the same data set can lead to false positive associations in the data.

Methodology: Performing a One-way ANOVA test on cleaned dataset for the first crop cycle (08/13- 09/11) to determine whether there is a difference in plant height between bays. This circumvents the multiple inference problem, as it is a single statistical test, which caters to multiple independent groups.

I used the SciPy.org library for statistical functions, scipy.stats, which has a .f_oneway function that performs a simple, one-way ANOVA test. A one-way ANOVA test compares the means of multiple independent groups to find evidence of at least one of the included groups having a population mean that is significantly different from the rest. The dataset for this model only contains the cleaned plant height (target value) and bay values for all plants for all dates in the crop cycle (above).

Satisfying requirements of the ANOVA model in the dataset: An ANOVA test is a parametric test (assumes that modelling from a probability distribution is feasible for the population). It is therefore only effective if:

The dependent variable is continuous (i.e., interval or ratio level) - Heights (in cm)
The independent variable is categorical (i.e., two or more groups) - Bays
Data roughly follows a normal distribution
Populations have similar variance

Given that 1 and 2 are satisfied, I had to ensure that my data satisfies conditions 3 and 4. For condition 3, I checked for variance in populations using np.std on plant height values (all bays fall in the range of 13.5-16.7). For condition 4, I checked for normal distribution in populations by generating a histogram for each bay’s heights. Barring the 0-10 cm bin, they all roughly follow a Gaussian distribution. To account for a high frequency of 0-10 cm values (when the plant is just a shoot), data from the first week is excluded for all bays (as this data is also irrelevant to the end-result height).

Results of the ANOVA test: The result of the ANOVA test is printed in the snippet of code below

Explanation and Analysis: As we had taken our alpha value (significance level below which the null hypothesis is rejected) to be 0.01 so as to abide by the strictest of measures, and 0.0002 less than 0.01- From this ANOVA test we conclude that there are bays with a greater average plant height than other bays. Conducting a one-way ANOVA test informed me that there is a significant difference between the heights of plants in different bays. Hence the first section of the research project is concluded.

Section 2: To find which bays this difference in mean heights is significant for

Methodology: Performing a Tukey’s Range Test to obtain pairs of bays which are significantly different from each other using the pairwise_tukeyhsd function from the statsmodels.stats.multicomp library.

Tukey’s Test: Tukey’s test is when you calculate the Honest Significance Difference (HSD) for all pairwise combinations of the populations in your study. It is calculated as shown below:

Result of the Tukey’s Test: Out of 45 pairs, 6 pairs showed a statistically significant difference with an alpha value of 0.05. Most significant among these were (Bay 205, Bay 210) and (Bay 206, Bay 210).

The Tukey test was then conducted on the section of the dataset which only consisted of the last week’s (09/11/2018) data to ensure that the pairs with the difference were consistent: there too, the null hypothesis was rejected for the pairs above.

Section 3: Cause of Difference in and Predicting Plant Heights

Methodology: To predict heights given a set of features, I decided to use a logistic regression model.

Step 1: Classify the plants into tall and short: Since what matters most is the end height of the plant, we limit the dataset to the plant heights in the last week of the crop cycle (09/11-09/18). Average height across all bays for this period is found to be 41.52 cm. All plants with a final height above this value are classified as ‘tall’, and those below it are classified as ‘short’. For our training dataset, values are assigned to the categories as follows:

Short: 0
Tall: 1

Shortlisting features: To shortlist features for the model, I wanted to find those that are most strongly correlated with the target variable (height category). I decided to perform a correlation analysis using the Pearson method, as I felt a covariance method would be ideal for this dataset. Correlations were derived using the Pandas dataframe ‘.corr’ correlation function. The result of the correlation analysis was as follows:

Outcome: The features with the highest correlation were found to be

Reservoir temperature: h2o_temp_C (positive)
Reservoir pH level: pH (positive)
Calcium nutrient solution: solution_Ca_ppmCA (negative)
Copper nutrient solution: solution_Cu_ppmCu (negative)
Nitrate nutrient solution: solution_NO3N_ppmNO3N (negative)

pH levels have a high positive correlation with the final week’s heights (recall from exploratory analysis, which is a significant outcome, as in our exploratory analysis, we found that pH levels were kept significantly lower in the final week of the growth cycle, as opposed to the first week. Given just that data, one would expect to find that more acidic levels encourage plant growth (if it were the aim of the study to have taller plants).

Key observation: All nutrient solutions (which serve a function similar to fertilizers in traditional agriculture, but are liquid chemicals for plant nutrition) with a strong correlation were negatively correlated to the plant height.

Feature Engineering: For nutrient solution values, the average of the two days in the crop cycle duration for which data was entered (08/14, 09/04) was taken for each solution.

Assumption of the Study: Data was entered for nutrient solutions by the researcher everytime it was changed for a bay.

Defining a Heuristic: I defined a simple heuristic which used np.random.choice to randomly classify each plant as 0 or 1 with an equal probability, and tested the accuracy of the model to get a baseline for the accuracy of the model. The baseline accuracy was 0.396, which means that the dummy model predicts correctly in 39.6% of cases. I then split the final dataset into training and testing datasets (70%-30% split). The logistic regression model I used is SkiKit library’s LogisticRegression.

Logistic regression: Logistic regression is a machine learning algorithm which, based on a set of cleaned and/or mathematically transformed numerical features, makes a binary classification for a target variable (which in this study is the height).

After fitting the model to my training dataset (which is used by the model to ‘learn’ about how to classify the target variable), I obtained a training score of 0.66 and a validation (test dataset) score of 0.54. A training score being greater than a validation score means that the model was significantly more accurate in its predictions on the data it was trained on, than on data that is unseen to it, indicating overfitting, wherein the model is too specific to the data, and inaccurately predicts for plants it hasn’t been built based on.

I decided to remove the 3 solutions from the list of features being tested, and see how a model based only on pH and temperature performs. This model did significantly better, by predicting height classification correctly in 66% of cases for the validation set (and 64% for the training set), which is significantly better than the baseline of 39.6%.

Conclusion

Each of the three sections tackles one research aim. To summarize, the key outcomes of the study were:

There were two harvest cycles that took place over the course of the data collection process.
Zooming in on one bay: The average plant, at its last stage of growth, is approx. 44 cm, and the growth trajectory of the average plant informs that the increase in height is steepest between days 18-23 post planting when pH is 6.1
There is a statistically significant difference in plant heights between bays.
The difference is most significant between bays 205 & 210 and bays 206 & 210.
Water temp., pH & Ca, Cu, NO3N solutions were most highly correlated to height.
Nutrient solutions were negatively correlated to plant height, therefore adding more nutrient solution actually led to plants ultimately being shorter: commercially significant
A logistical regression model was developed using features pH and temp, which has an accuracy of 66% on the validation set.

Further Applications of the Study

The ultimate motivation behind the study was to find commercially and academically significant findings pertaining to hydroponically grown basil. Here is how the findings of the study can be applied

Academic: pH is one of the features which is strongly correlated to height. Potentially counters studies like “Influence of nutrient solution pH on hydroponic basil plant growth”and nutrient content, Ohio State University, which state that pH is not a determining factor in plant growth, post further research.

Commercial: The characteristic mild, sweet taste of basil leaves is a result of heavily pruning the plant (to 1-2inches). Flowering makes leaves bitter. It was also found that the three nutrient solutions (Calcium, copper, nitrate) in my correlation analysis were negatively associated with plant growth. This could aid a commercial grower wanting to grow shorter, more flavourful basil.

Semester

Fall 2020

Researcher

Dayawanti Punj

Navigation

Motivation
Introduction
The Data
Hypothesis & Research Aim
Aims of the Study
Exploratory Data Analysis
Key Outcomes of Exploratory Data Analysis
Section 1: Difference Between The Heights Of Plants In Different Bays
Section 2: To find which bays this difference in mean heights is significant for
Section 3: Cause of Difference in and Predicting Plant Heights
Conclusion
Further Applications of the Study

Executive / Directors

Member Profiles

Big-Little Tree