SD card errors on A64-OLinuXino-2Ge8G-IND

Started by ilario, September 13, 2023, 05:00:11 PM

Previous topic - Next topic

ilario

Dear all,
I have two A64-OLinuXino-2Ge8G-IND each one with a Kingston SDCIT2/32GB SD card and running A64-OLinuXino-bullseye-base-20230515-130040 image with kernel 5.10.180-olimex.

I tested one of the cards with badblocks -n after connecting it with the USB SD card reader available on Olimex website and found no errors.

On both A64-OLinuXino-2Ge8G-IND units, after some random time using them with disk activity, I get errors in dmesg, like these ones:

mmc_erase: group start error -110, status 0x0
sunxi-mmc 1c0f000.mmc: data error, sending stop command
sunxi-mmc 1c0f000.mmc: send stop command failed
blk_update_request: I/O error, dev mmcblk0, sector 333704 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
EXT4-fs warning (device mmcblk0p1): ext4_end_bio:351: I/O error 10 writing to inode 272917 starting block 41714)
Buffer I/O error on device mmcblk0p1, logical block 39153
blk_update_request: I/O error, dev mmcblk0, sector 47925289 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0

The internal eMMC memory never showed any of such messages.

I tried to reproduce these very nice instructions but cannot find any of the files mentioned (and I am not brave enough to create a dtb overlay by myself).

I did kind of a fancy setup but I believe that the errors I am observing are due to the SD card driver/speed or something like that... My setup is: the 32 GB SD card is partitioned with a 24 GB and a 8 GB partition. The second partition has exactly the same size in sectors as the internal eMMC memory partition, and these two partitions are joined in a RAID1 (mirror) with the original idea of minimizing the risk of data losses. This md0 device is used for storing the database of a Prometheus instance running on the A64-OLinuXino-2Ge8G-IND.

Thanks!
Ilario

LubOlimex

Do you perform hardware power offs? Maybe just the fs on the EEPROM gets corrupted?

The instructions you found in the forum no longer apply. You are using the latest image that already has the fixes applied. Please ignore the instructions from the forum.

My first recommendation is:

1) Check the cards with F3 or H2testw software beforehand, this will show any errors with the cards.

2) Write the cards with BalenaEtcher or USBImager both program are free and allow for verification after write.

Let me know if the issue persists after that.
Technical support and documentation manager at Olimex

ilario

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMDo you perform hardware power offs?

A few times I forgot to issue the poweroff command before unplugging, could this cause the issue?

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMMaybe just the fs on the EEPROM gets corrupted?

EEPROM? How do I check that?

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMMy first recommendation is:

1) Check the cards with F3 or H2testw software beforehand, this will show any errors with the cards.

Ok, I will. Up to now I just checked one of the 2 SD cards using badblocks -n (non-destructive write test). I will test both with F3.

Quote from: LubOlimex on September 14, 2023, 08:53:18 AM2) Write the cards with BalenaEtcher or USBImager both program are free and allow for verification after write.

I made one of the two SD cards with BalenaEtcher on Windows 11 and the second one with dd on Arch Linux.

Just one more piece of information, that I forgot to add in my initial post:

From dmesg the speed of the SD card and eMMC memory are:
mmc0: new high speed SDHC card at address 5048
mmcblk0: mmc0:5048 SDCIT 29.9 GiB
 mmcblk0: p1 p2
mmc1: new high speed MMC card at address 0001
mmcblk1: mmc1:0001 Q2J55L 7.09 GiB
mmcblk1boot0: mmc1:0001 Q2J55L partition 1 16.0 MiB
mmcblk1boot1: mmc1:0001 Q2J55L partition 2 16.0 MiB
 mmcblk1: p1

(interestingly, I cannot mount mmcblk1boot0 nor mmcblk1boot1, but this is another topic)

I will update you as soon as I manage to complete the F3 tests on both cards.

LubOlimex

> A few times I forgot to issue the poweroff command before unplugging, could this cause the issue?

Yes. To avoid that avoid full power offs. If that is hard to achieve consider using a Li-Po battery as back up.
Technical support and documentation manager at Olimex

ilario

Updates:

Using the USB SD card reader, on both SD cards, I did:
  • test the OS partition with f3write and f3read, no errors for either of the cards
  • test the RAID partition with badblocks -n (it was easier than looking for a way to mount the partition), no errors for either of the cards

Then I put the SD cards back into the Olinuxino units and tested in the same way from there:
  • the OS partition with f3write and f3read (write speed 18 MB/s, read speed 23 MB/s), no errors
  • the RAID partition (without the RAID running, as it degraded due to the previous errors) tested with badblocks -n, no errors

It is quite surprising that during normal operation I see errors and during these tests I do not see them.
After adding back the SD card partition to the RAID I ran F3 on the RAID (17 MB/s write, 30 MB/s read. Not bad) and observed no error on any of the A64 units.

I will let them run over the weekend, to see if the problem appears again.

Another (very likely unrelated) weird behaviour I observed is that mmcblk0 and mmcblk1 randomly swap at each boot (sometimes mmcblk0 is the SD card and sometimes is the internal eMMC). Is this expected?

ilario

Checking journalctl logs, I saw that the errors started after this line:

systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
This message comes from fstrim.timer systemd unit that launches fstrim.service that issues this command:

/sbin/fstrim --listed-in /etc/fstab

Running manually the commands

fstrim -v /
or

fstrim -v /mnt/raid1
triggered the same errors visible in dmesg.

So, for now I disabled the fstrim.timer systemd unit with:
systemctl disable fstrim.timer
systemctl stop fstrim.timer

I will let you know if the problem appears again.

Does this problem happen to everyone?
What is the right solution? Is this a drivers' bug?
I tried adding a "nodiscard" option in /etc/fstab but did not work at all (seems that nodiscard is an option accepted by mount but not recognized in fstab).




ilario

Updates:

After disabling the automatic trimming, no other errors appeared.

Now I set up another A64-OLinuXino A64-OLinuXino-2Ge16G-IND hardware revision G with a new SD card (SDCIT2/32GB Kingston Memory Cards 32GB microSDHC Industrial C10 A1 pSLC Card), freshly installed Debian Bullseye following the Olimex guide, no fancy partitions no RAID.
Before disabling the automatic trimming, I observed the same issues.


mmc_erase: group start error -110, status 0x0
sunxi-mmc 1c0f000.mmc: data error, sending stop command
sunxi-mmc 1c0f000.mmc: send stop command failed

LubOlimex

So it get fixed by disabling trimming again?

About the error:

Do you boot from the eMMC or the card?

When does this error appear? Is it one time during and after boot or it repeats after a while?

Does the board recover from the problem or it hangs and needs a reboot?

Technical support and documentation manager at Olimex

ilario

Quote from: LubOlimex on January 08, 2025, 04:51:20 PMSo it get fixed by disabling trimming again?

Yes :)
Do you think it could depend on the SD card I am using?
I bought it from here (and supposedly it is a good one):
https://www.mouser.es/ProductDetail/524-SDCIT2-32GB

Quote from: LubOlimex on January 08, 2025, 04:51:20 PMAbout the error:

Do you boot from the eMMC or the card?

From the card. I find it more convenient, so that if I break something in the boot, I can plug it to the laptop and fix it.

Quote from: LubOlimex on January 08, 2025, 04:51:20 PMWhen does this error appear? Is it one time during and after boot or it repeats after a while?

After a while. I suppose it happens when the systemd decides to run fstrim.

Quote from: LubOlimex on January 08, 2025, 04:51:20 PMDoes the board recover from the problem or it hangs and needs a reboot?

Usually nothing bad happens. It happened once that I got that error during an apt upgrade, and the upgrade was interrupted. And when I was using RAID (bad idea, I stopped doing that) the RAID was de-synchronizing.

If you are interested in seeing the errors inside a full dmesg log, have a look at this one from November 2023. https://nextcloud.pangea.org/s/LZZTGtsgXPTJF36 You will see also RAID-related messages as back then I was still using it.