Krishi Dristi

Prototype

2026-03-1418 min

PythonOpenCVTensorFlowMobileNetV2

note: This is an ongoing prototype. The write-up documents the current architecture, experiments, and lessons so far rather than a finished deployed system.

KrishiDrishti — Building Offline-First Aerial Intelligence for Farms That Don't Fit the Textbook

A case study in system design, applied ML, and the gap between a Kaggle notebook and a working product.

1. The problem, stated the way it actually showed up

Agriculture is close to 14% of Bangladesh's GDP and employs roughly 41% of the population, but the tooling built for "precision agriculture" — DJI Agras, XAG, Climate FieldView, Granular — is designed for a different geometry of farm entirely: large, contiguous, single-crop fields in the US, Australia, or China. Bangladesh's farmland is the opposite of that. A single hectare can be split across five smallholders, each growing a different crop, with plot boundaries defined by informal earthen ridges instead of GPS-surveyed lines. Roughly 80% of farmers are smallholders who will never buy a ৳300,000+ drone, and the people spraying pesticide are doing it by hand, for two hours per bigha, with no protective equipment.

That's the actual problem. Not "agriculture is inefficient" in the abstract — a specific mismatch between the unit of analysis commercial ag-tech assumes (a field) and the unit of analysis that actually exists on the ground (a fragmented plot). Any system that sprays or scouts "the field" uniformly is solving the wrong problem here. That mismatch became the design constraint that shaped everything downstream: the system had to reason about plots, not fields, and it had to do it without assuming internet connectivity, because rural connectivity in flood season is exactly when you need this most and exactly when you don't have it.

I scoped the system as two cooperating machines: a cheap scout drone that flies low and photographs, and an AI pipeline that turns those photographs into a GPS-tagged prescription — which plots are diseased, which are flood-damaged, and what should happen next. A second, heavier payload drone (or a laser, in the prototype) acts on that prescription. The whole thing had to run for under ৳35,000 in prototype form, which meant every technical decision was also a cost decision.

2. Why "offline-first" wasn't a checkbox, it was the architecture

It would have been easier to build this as a typical SaaS: upload images to a cloud endpoint, run inference on a GPU server, return JSON. I didn't do that, and not out of purism — during the 2022 Sylhet floods, connectivity in the affected areas was one of the first things to fail, at exactly the moment damage assessment mattered most. A system that depends on connectivity to do its core job is a system that fails during the disaster it claims to help with.

So "offline-first" became a hard constraint that ruled out entire categories of otherwise-reasonable design choices:

No calling a hosted vision API for classification — the model has to run locally, on a laptop CPU, in the field.
No relying on cloud storage for the prescription pipeline — everything writes to local disk and gets synced later if connectivity exists.
No assuming the camera has GPS — action cameras in this price bracket don't, so location has to be reconstructed after the fact.

Each of these constraints forced a genuinely harder engineering problem than the cloud-first version would have. That's the pattern I'd underline for anyone starting a similar project: your non-functional constraints (offline, cheap, field-repairable) will dictate your architecture far more than your functional requirements will. I designed the functional requirements first and paid for it later when I had to retrofit the offline constraint into a pipeline that assumed always-available compute.

3. Deep dive: the technical problems that actually took time

The parts of this project that "sound simple" in the architecture diagram were, without exception, the parts that ate the most hours. I'll go through them in the order I hit them, because the order matters — each one exposed an assumption from the previous stage that turned out to be wrong.

3.1 The camera has no idea where it is

The SJCAM SJ4000 Air captures beautiful 16MP time-lapse stills. It has no GPS chip. The flight controller (an APM 2.8) does log GPS, but as a separate stream with its own clock, written to a MAVLink binary log or exported as CSV from Mission Planner. Nothing ties a photo to a coordinate natively.

My first instinct — and this is the mistake I want to flag explicitly because I think most beginners make it — was to assume I could just match photo index to GPS point index by counting: photo 1 goes with the 1st logged position, photo 2 with the 2nd, and so on. That breaks immediately in practice because the two devices sample at different, non-deterministic rates: the camera fires roughly every 3 seconds but with real-world jitter from write latency to the SD card, and the flight controller logs GPS fixes at whatever rate its GPS module updates, which is not synchronized to the camera at all. Index-matching silently produces a coherent-looking but wrong GPS trail.

The actual fix was timestamp correlation: extract DateTimeOriginal from each photo's EXIF block, extract timestamped GPS fixes from the flight log, and for every photo find the GPS fix with the minimum absolute time delta. That's a nearest-neighbor match on a 1D timeline — simple in principle, but it exposed a second, nastier problem: clock drift. The camera's internal clock and the flight controller's system time are two independent, unsynchronized clocks. If nobody remembers to sync the camera's clock before flight, every single photo-to-GPS match is offset by a constant error — sometimes several minutes — which at typical scout-drone speed can put the "diseased plot" GPS coordinate in a completely different field. I ended up logging the time delta for every correlation and flagging anything over 5 seconds as suspect, specifically so that a silent GPS error doesn't become a silent laser-marking error later in the pipeline. That's the kind of bug that doesn't crash anything — it just quietly tells the payload drone to spray the wrong plot, which is worse than crashing.

For the demo, when the correlation pipeline wasn't reliable enough to trust live, I built a deliberate fallback: manually record the four GPS corners of the test field before flight, and interpolate photo positions across a known grid pattern. It's an admission that the "real" solution wasn't robust enough yet, and I think that honesty matters in a case study — production readiness and demo readiness are different bars, and pretending otherwise is how demos become embarrassing in front of people who ask hard questions.

3.2 Why I didn't use deep learning for plot boundaries

This is the decision I'd defend hardest, and it's the core of the "pattern recognition" story in this project. The instinct, coming from an ML-first mindset, is: segmentation is a computer vision problem, computer vision problems get solved with a CNN, therefore train a segmentation model on plot boundaries. I actually started down that road and stopped, for a reason that's more about epistemics than about code: there is no labeled dataset of Bangladeshi fragmented plot boundaries anywhere. Not a small one, not a noisy one — none. Training a segmentation model would have meant hand-labeling my own dataset from scratch, which for a first ML project with a competition deadline was not a realistic bar to clear, and even if I'd cleared it, a model trained on a few dozen hand-labeled images from one location would generalize badly to a different field's soil color, ridge height, and lighting.

So I used classical computer vision instead, and I think this is the more interesting engineering decision than the neural network is: plot boundaries in this kind of farmland are visually distinct by construction — earthen ridges, color discontinuities, water channels — which means the pattern is recoverable with hand-engineered features rather than learned ones. The pipeline is a chain of well-understood primitives: HSV color-space vegetation masking, morphological erosion/dilation to denoise, Canny edge detection, probabilistic Hough transform to recover line segments, angle/length filtering to reject noise, then connected-component analysis to turn the surviving lines into enclosed plot regions.

The reason this was the right call, not just the convenient one: classical CV here is deterministic and debuggable. When a boundary detection is wrong, I can look at the HSV mask and see exactly why — a shadow got classified as a ridge, or two nearly-parallel ridges didn't get merged. With a CNN, a wrong segmentation is a black box you can't interrogate without another whole tooling investment (Grad-CAM, saliency maps, and so on) that isn't worth building for a prototype. Pattern recognition doesn't have to mean deep learning — sometimes the pattern is structurally simple enough that a human can specify the recognition rule directly, and when that's true, doing so is more robust, not less sophisticated. Knowing which regime you're in — "the pattern is learnable from data I have" vs. "the pattern is specifiable by a human and I don't have the data anyway" — was the single most useful piece of judgment I developed on this project.

3.3 Training the disease classifier — where deep learning was the right tool

Disease classification is the opposite case: the visual signature of bacterial leaf blight vs. rice blast vs. a healthy leaf is not something I could write an HSV rule for, and unlike plot boundaries, a real labeled dataset exists (PlantVillage, ~54,000 images across 38 classes). This is where transfer learning earns its keep.

I used MobileNetV2 pretrained on ImageNet, not because it's the most accurate architecture available, but because of a constraint that mattered more here: it has to run inference on a laptop CPU in the field, in under 500ms per frame, with no GPU guaranteed. MobileNetV2's depthwise-separable convolutions trade some accuracy for a dramatically smaller parameter count (3.4M) and lower FLOPs than a comparable ResNet or EfficientNet, which is exactly the trade a field-deployed offline system should make. Accuracy that requires a GPU you don't have in the field is not actually accuracy you can use.

I trained in two phases, and the reason for splitting it into two phases is worth explaining rather than just stating, because it's a subtlety that's easy to skip past:

Phase 1 — frozen base, train only the classification head. With a small, domain-shifted dataset (leaf photos are not ImageNet photos), if you unfreeze the whole backbone from the start, the large gradient updates driven by a randomly-initialized head propagate back into the pretrained weights and destroy the general-purpose features the backbone already learned — this is the classic catastrophic forgetting failure mode of transfer learning. Freezing the base and training only the new dense layers first lets the head reach a sane starting point without corrupting the backbone.
Phase 2 — unfreeze the top layers only, fine-tune at a much lower learning rate. Once the head is stable, unfreezing the last ~30 layers of MobileNetV2 (everything after layer index 100) and fine-tuning at 1/10th the original learning rate lets the backbone adapt its higher-level features to leaf-specific texture and color patterns, without the lower-level generic features (edges, gradients, basic shapes) getting wrecked by a learning rate too aggressive for weights that were already well-optimized.

The other lesson from training: fewer, better-chosen classes beat more classes, every time, for a small dataset. The full PlantVillage set has 38 classes, but most of those are irrelevant to Bangladeshi agriculture (grapes, blueberries, strawberries) and diluting a small effective sample size across 38 classes tanks per-class accuracy. Cutting down to 10 classes — rice, tomato, and potato, healthy plus their most common diseases — was a direct, deliberate trade of generality for accuracy that a competition demo needs and a research paper doesn't.

I also learned the hard way that data augmentation parameters are not free — I initially used aggressive rotation and shear ranges copied from a generic tutorial, which was fine for the PlantVillage close-up leaf shots the dataset was built from, but produced augmented images that no longer resembled what an aerial camera at 5 meters altitude actually sees. Augmentation should model the deployment distribution, not just "more variation is always better." Vertical flips, wide brightness ranges (simulating harsh midday sun vs. overcast), and moderate zoom made sense for aerial shots; extreme shear did not.

3.4 Detecting disease at aerial scale — the grid problem

A single aerial frame at 5m altitude covers roughly 8m × 5m of ground. Disease frequently affects a small patch within that frame, not the whole thing. Running the classifier once on the full frame averages the diseased patch into the surrounding healthy majority and the signal disappears — this is the same failure mode as "the average of a spike and a flat line is a slightly-raised flat line."

The fix is to divide each frame into a grid (I used 4×4 = 16 cells), resize each cell independently to the model's 224×224 input, and classify cells independently before reassembling the results into a spatial disease map. This is a small architectural decision with an outsized effect: it converts a single whole-image classification problem into 16 independent local classification problems, which is closer to the actual spatial structure of the phenomenon being detected. It costs 16x the inference calls per frame, which is why the CPU inference time budget (<500ms per frame, not per cell) had to be taken seriously — MobileNetV2's efficiency wasn't a nice-to-have here, it was what made the grid approach computationally feasible at all on commodity hardware.

3.5 Vegetation health without a multispectral sensor

True NDVI needs a near-infrared band, which means a multispectral camera, which the ৳35,000 budget doesn't allow. Instead I used RGB-derived vegetation indices — Excess Green (2G − R − B) and Excess Red (1.4R − G), combined as Health = ExG − ExR and normalized to a 0–100 scale. These are decades-old agronomy formulas, not something I invented, but applying them as a real-time heatmap over drone footage and correlating the result per-plot with the boundary detector from 3.2 was the integration work. The honest caveat, which I kept in the technical docs and repeat here: RGB indices correlate with true NDVI for detecting obvious stress but are not a substitute for it. That's a real limitation of the prototype, and the production roadmap explicitly upgrades to a NIR-capable camera rather than pretending the RGB approximation is good enough forever.

3.6 Flood detection — where "it works on my images" almost bit me

Flood/water detection uses the same family of technique as plot boundaries: HSV thresholding for standing water and muddy water, combined with the vegetation mask to compute water percentage and vegetation loss per plot, then a threshold-based severity classification (>70% water = severe, >30% = moderate, else intact).

The trap here — and I want to name it explicitly because it's a general trap with any threshold-based classical CV system — is that HSV thresholds tuned on one set of sample images will silently fail on a different water color, soil type, or lighting condition, and they fail quietly. A misclassified pixel doesn't throw an exception; it just produces a plausible-looking but wrong damage percentage. I don't have a clean solution to this in the prototype — the honest answer is that these thresholds need site-specific calibration, and I flagged that as an open production requirement rather than hiding it. A system that reports "62% flooded" with false confidence is more dangerous during a real disaster-response decision than a system that says "I'm not sure."

3.7 The laser demo — debugging a 1998-era file format

The "AI finds disease, drone flies back and marks it with a laser" demo required generating a Mission Planner-compatible .waypoints file in the QGC WPL110 format — a plain-text, tab-separated format with positional fields (command ID, frame, autocontinue flags, lat/lon/alt) that has essentially no documentation outside old ArduPilot forum posts. Getting the DO_SET_RELAY command (ID 181) sequenced correctly after the NAV_WAYPOINT command (ID 16) so the laser fires only once the drone has actually arrived and hovered at the diseased plot's coordinates took more trial-and-error against a real flight controller than any of the ML work did. It's a good reminder that in a hardware-integrated system, the "boring" glue format between two pieces of infrastructure can be a bigger time sink than the AI component that gets top billing in the pitch deck.

4. Security — the part most student projects skip, and that I almost skipped too

I want to be direct about this: a prototype built to win a competition demo and a system that's safe to put in front of real farmers' land data and real flight hardware are different bars, and I did not fully clear the second bar. Writing that down honestly is, I think, more useful to a reader than pretending the prototype is production-secure. Here's the actual threat model I ended up thinking through, and where the prototype stands against it.

Telemetry and command-link security. MAVLink, the protocol the APM 2.8 flight controller speaks, has historically shipped with no encryption and weak-to-no authentication on many common configurations. Anyone with a compatible radio and physical proximity to the drone's telemetry frequency can potentially intercept flight data or, worse, inject commands. MAVLink 2.0 added optional signing, but it has to be deliberately configured — it isn't the default. For the prototype this is an accepted risk because the flight envelope is small, supervised, and VLOS; for a production system flying autonomous missions over farmland, MAVLink signing and a proper key-management story for the ground station are non-negotiable, not an afterthought.

The Streamlit demo app has no authentication. By design, for a competition demo, the app runs locally with streamlit run app.py and anyone on the local network segment it's bound to can reach it. That's fine for a laptop on a table at a competition. It is not fine as a description of how a farmer-facing product should work, and I want to flag that gap rather than let the demo's simplicity imply the eventual product is equally open. A real deployment needs, at minimum: authenticated sessions scoped to a farmer or cooperative's own data, and no shared visibility into another farmer's plot health or GPS coordinates by default.

Uploaded-image handling is an attack surface, not just a UX feature. The file uploader accepts JPEG/PNG from the user with no validation beyond the extension whitelist Streamlit provides. I did not, in the prototype, add magic-byte verification, size limits enforced server-side (not just client-side), or defenses against maliciously crafted image files designed to exploit a decoder vulnerability in OpenCV or Pillow — both of which have had real CVEs in their image-parsing code over the years. For a prototype processing pre-vetted demo images, this risk is low. For any deployment accepting uploads from arbitrary users, it's a concrete, well-understood class of vulnerability (decompression bombs, decoder buffer overflows, path traversal via crafted filenames) that needs real input validation, not an extension check.

GPS coordinates are more sensitive than they look. It's easy to treat {"lat": 23.9915, "lon": 90.4125} as just a technical parameter. It is also, precisely, the location of a specific farmer's land, correlated with data about that land's crop health, disease status, and — in emergency mode — flood damage used for compensation claims. That combination (precise geolocation + economic/insurance-relevant data + a specific identifiable smallholder) is exactly the kind of dataset that needs an actual data-governance story before it goes anywhere near production: who can query which farmer's plots, how long the prescription JSON is retained, whether compensation-report data is shared with government agencies without explicit consent, and how a farmer's data is deleted on request. None of that existed in the prototype. I'm naming it here because building the AI pipeline is the easy 80% of a system like this, and the governance layer around sensitive agricultural and geolocation data is the harder, less glamorous 20% that a real deployment cannot skip.

Model and artifact integrity. The .h5 model file and class label JSON are loaded from local disk with no integrity check — no hash verification, no signature. On a laptop you control physically for a demo, that's a non-issue. On any device where the model file could be swapped by someone with local access (a shared field laptop, a compromised USB drive used to move files, which the demo-day checklist explicitly relies on), a corrupted or replaced model file could silently change what gets classified as "healthy" vs. "diseased" with no visible failure signal. That's a supply-chain-style risk worth taking seriously the moment this leaves a single trusted laptop.

What I got right, mostly by virtue of the offline-first constraint rather than deliberate security design: no cloud API keys embedded anywhere, no third-party data broker in the loop, no telemetry phoning home, and the entire attack surface of "a cloud backend gets breached and every farmer's field data leaks at once" simply doesn't exist because there is no cloud backend. Offline-first bought real security properties as a side effect of a cost and reliability decision, which is a pattern worth remembering: sometimes the constraint that looks like it's making your life harder is also quietly closing off a whole category of risk you'd otherwise have to defend separately.

5. Prototype vs. production — being honest about the gap

| | Prototype (what exists) | Production (what it needs to become) | |---|---|---| | Cost | ৳35,000 | ৳80,000–150,000 | | Flight controller | APM 2.8 | Pixhawk 6X | | Camera | RGB action cam, no GPS | Multispectral, GPS-tagged natively | | Processing | Laptop, offline batch | Onboard Raspberry Pi / edge compute | | GPS accuracy | ~2.5m (NEO-6M) | ~10cm (dual-antenna RTK) | | Telemetry security | Unsigned MAVLink | MAVLink 2.0 signing + key management | | App access | No auth, local network | Authenticated, per-farmer data isolation | | Data governance | None | Retention policy, consent, access control |

Both ends of that table are still roughly 90%+ cheaper than commercial agricultural drone systems, which is the number that actually matters for the target user — but the table is also a deliberate admission that "it works in the demo" and "it's safe to deploy" are not the same milestone, and conflating them is the single most common mistake I see (and made myself) in student and hackathon ML projects.

6. What I'd tell myself before starting

If I were starting this project again, knowing only what I knew at the beginning, here's what I'd want someone to have told me:

Decide, module by module, whether you actually have the data to justify a neural network — and if you don't, a classical CV pipeline you can debug by eye is not a lesser solution, it's the correct engineering choice for that module. I nearly burned a week trying to force a segmentation model onto a problem that didn't have training data, before realizing the pattern was simpler than the tool I was reaching for.
Your non-functional constraints (cost, offline operation, field-repairability) will do more to shape your architecture than your functional requirements will. Design them in from day one; retrofitting "offline" onto a cloud-first design after the fact is much more expensive than the reverse.
A silent wrong answer is worse than a crash. GPS mismatches, HSV misclassifications, and stale model files don't throw exceptions — they produce a plausible, wrong number that flows downstream into a real-world action (spraying the wrong plot, misreporting flood damage). Build the "flag this as suspect" logic in from the start, not as a patch afterward.
Security and data governance for a system touching real people's land and livelihood data is not a "later" problem. I treated it as one, and in hindsight the governance layer deserved design attention from week one, alongside the model architecture, not after the demo worked.
Two-phase transfer learning, small deliberately-chosen class sets, and matching your augmentation strategy to your actual deployment distribution were the three highest-leverage ML decisions on this project — far more than architecture search or hyperparameter tuning would have been, given the size of the dataset I actually had.

This was my first machine learning project, built without a lab, a mentor, or a team of ML engineers to check my work — just documentation, forum posts, a lot of failed training runs, and a hard deadline. I'm sharing the mistakes alongside the parts that worked because I think that's the more useful artifact for anyone reading this who's about to start their own first project: the architecture diagram is easy to copy. The judgment about why each piece is built the way it is — and being honest about what still isn't solved — is the part actually worth writing down.